With local core counts on the rise, taking advantage of shared-memory to optimize collective operations can improve performance. We study several on-host shared memory optimized algorithms for MPI_Bcast, MPI_Reduce, and MPI_Allreduce, using tree-based, and reduce-scatter algorithms. For small data operations with relatively large synchronization costs fan-in/fan-out algorithms generally perform best. For large messages data manipulation constitute the largest cost and reduce-scatter algorithms are best for reductions. These optimization improve performance by up to a factor of three. Memory and cache sharing effect require deliberate process layout and careful radix selection for tree-based methods. © 2008 Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Graham, R. L., & Shipman, G. (2008). MPI support for multi-core architectures: Optimized shared memory collectives. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5205 LNCS, pp. 130–140). https://doi.org/10.1007/978-3-540-87475-1_21
Mendeley helps you to discover research relevant for your work.