Many-core processors, such as graphic processing units (GPUs), are promising platforms for intrinsic parallel algorithms such as the lattice Boltzmann method (LBM). Although tremendous speedup has been obtained on a single GPU compared with mainstream CPUs, the performance of the LBM for multiple GPUs has not been studied extensively and systematically. In this article, we carry out LBM simulation on a GPU cluster with many nodes, each having multiple Fermi GPUs. Asynchronous execution with CUDA stream functions, OpenMP and non-blocking MPI communication are incorporated to improve efficiency. The algorithm is tested for two-dimensional Couette flow and the results are in good agreement with the analytical solution. For both the one- and two-dimensional decomposition of space, the algorithm performs well as most of the communication time is hidden. Direct numerical simulation of a two-dimensional gas-solid suspension containing more than one million solid particles and one billion gas lattice cells demonstrates the potential of this algorithm in large-scale engineering applications. The algorithm can be directly extended to the three-dimensional decomposition of space and other modeling methods including explicit grid-based methods. © 2012 Science China Press and Springer-Verlag Berlin Heidelberg.
CITATION STYLE
Xiong, Q. G., Li, B., Xu, J., Fang, X. J., Wang, X. W., Wang, L. M., … Ge, W. (2012). Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units. Chinese Science Bulletin, 57(7), 707–715. https://doi.org/10.1007/s11434-011-4908-y
Mendeley helps you to discover research relevant for your work.