Data transposition is required in many numerical applications. When implemented on a distributed-memory computer, data transposition requires all-to-all communication, a time consuming operation. The Direct Exchange algorithm, commonly used for this task, is inefficient if the number of processors is large. We investigate a series of more sophisticated techniques: the Ring Exchange, Mesh Exchange and Cube Exchange algorithms. These data transposition schemes were incorporated into a parallel solver for the shallow-water equations. We compare the performance of these schemes with that of the Direct Exchange Algorithm and the MPI all-to-all communication routine, MPI_AllToAll. The numerical experiments were performed on a Cray T3E computer with 512 processors and on an ethernet-connected cluster of 36 Sun workstations. Both the analysis and the numerical results indicate that the more sophisticated Mesh and Cube Exchange algorithms perform better than either the simpler well-known Direct and Ring Exchange schemes or the MPI_AllToAll routine. We also generalize the Mesh and Cube Exchange algorithms to a d-dimensional mesh algorithm, which can be viewed as a generalization of the standard hypercube data transposition algorithm.
CITATION STYLE
Christara, C., Ding, X., & Jackson, K. (2005). An Efficient Transposition Algorithm for Distributed Memory Computers. In High Performance Computing Systems and Applications (pp. 349–370). Kluwer Academic Publishers. https://doi.org/10.1007/0-306-47015-2_38
Mendeley helps you to discover research relevant for your work.