We describe a parallel implementation of a compressible Lattice Boltzmann code on a multi-GPU cluster based on Nvidia Fermi processors. We analyze how to optimize the algorithm for GP-GPU architectures, describe the implementation choices that we have adopted and compare our performance results with an implementation optimized for latest generation multi-core CPUs. Our program runs at ≈ 30% of the double-precision peak performance of one GPU and shows almost linear scaling when run on the multi-GPU cluster. © 2012 Springer-Verlag.
CITATION STYLE
Biferale, L., Mantovani, F., Pivanti, M., Pozzati, F., Sbragaglia, M., Scagliarini, A., … Tripiccione, R. (2012). A multi-GPU implementation of a D2Q37 Lattice Boltzmann code. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7203 LNCS, pp. 640–650). https://doi.org/10.1007/978-3-642-31464-3_65
Mendeley helps you to discover research relevant for your work.