Abstract
In the era of the accelerator, load balancing strategies that are well-understood for traditional homogeneous supercomputers must be re-worked in order to address the problem of distributing work across heterogeneous hardware such that neither the CPU nor the accelerator is left idle. Whereas partitioning for a homogeneous system need only balance node-level workload against single-node performance, partitioning for an accelerator-enabled system requires a nested partitioning scheme that ensures both an optimal intra-node and inter-node load balance. We refer to this as enclave partitioning. Our parallelization scheme allows the same shared-memory-level code to be used on both the CPU and the accelerator, and also allows inter-node communication code to be reused during CPU-accelerator communication. This is in contrast to the traditional \offload" model, in which accelerator code can differ significantly from CPU code in both form and programming language. Using a hybrid MPI-OpenMP implementation of the acoustic-elastic wave propagation, we demonstrate the efficacy of the proposed partitioning scheme on a heterogeneous, Intel® Xeon PhiTM-accelerated supercomputer (Stampede). With our approach we have realized speedups of up to 5.78x using a 7th order discretization and 6.88x for 15th order discretization relative to a baseline, pure-MPI implementation. We present strong and weak scaling results as well as individual node performance to illustrate the benefits and limits of the accelerator-enabled scientific computing.
Author supplied keywords
Cite
CITATION STYLE
Sundar, H., & Ghattas, O. (2015). A nested partitioning algorithm for adaptive meshes on heterogeneous clusters. In Proceedings of the International Conference on Supercomputing (Vol. 2015-June, pp. 319–328). Association for Computing Machinery. https://doi.org/10.1145/2751205.2751246
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.