We analyze the bottlenecks in the parallel FFT algorithm and describe optimizations carried out for the algorithm on the Blue Gene/L Supercomputer. We identified three avenues for improving the performance of the algorithm - single-node FFT performance, Alltoall collective performance and overlap of computation and communication. Performance at all these levels has been optimized using the double-hummer intrinsics of the Blue Gene/L CPU, careful ordering and synchronization of messages in Alltoall communications and suitable interleaving of message exchanges with computations. Using these optimizations, we obtained 20% performance improvement over the baseline version on the 64 racks Blue Gene/L system. We give a brief overview of the Alltoall optimizations, describe our computation-communication overlap strategy and present results for strong scaling and weak scaling of parallel FFT on Blue Gene/L. We also discuss the fundamental limits to scaling of the parallel transpose algorithm for computing FFT. © 2008 Springer Berlin Heidelberg.
CITATION STYLE
Sabharwal, Y., Garg, S. K., Garg, R., Gunnels, J. A., & Sahoo, R. K. (2008). Optimization of fast fourier transforms on the blue gene/L supercomputer. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5374 LNCS, pp. 309–322). Springer Verlag. https://doi.org/10.1007/978-3-540-89894-8_29
Mendeley helps you to discover research relevant for your work.