Fast Simulations of Gravitational Many-body Problem on RV770 GPU
Management (2009)
- arXiv: 0904.3659
Available from arxiv.org
or
Abstract
The gravitational many-body problem is a problem concerning the movement of bodies, which are interacting through gravity. However, solving the gravitational many-body problem with a CPU takes a lot of time due to O(N 2) computational complexity. In this paper, we show how to speed-up the gravitational many-body problem by using GPU. After extensive optimizations, the peak performance obtained so far is about 1 Tflops.
Available from arxiv.org
Page 1
Fast Simulations of Gravitational...
arXiv:0904.3659v1 [astro-ph.IM] 23 Apr 2009 Fast Simulations of Gravitational Many-body Problem on RV770 GPU Kazuki Fujiwara and Naohito Nakasato Department of Computer Science and Engineering University of Aizu Aizu-Wakamatsu, Fukushima 965-0815, Japan Contact Email: nakasato@u-aizu.ac.jp Abstract The gravitational many-body problem is a problem concerning the movement of bodies, which are interacting through gravity. However, solving the gravitational many-body problem with a CPU takes a lot of time due to O(N 2) computational complexity. In this paper, we show how to speed-up the gravitational many-body problem by using GPU. After extensive optimizations, the peak performance obtained so far is ��� 1 Tflops. I. INTRODUCTION A gravitational many-body simulation technique is fundamental in astrophysical simulations because gravity force drives the structure formation in the universe. Length scales arisen in the structure formation range from less than 1 cm at aggregation of dust to more than 1024 cm at formation of cosmological structure. In all scales, gravity is a key physical process to understand the structure formation. The reason behind this is long-range nature of gravity. Suppose we simulate the structure formation with N particles, a flow of a many-body simulation is as follows. First, we calculate mutual gravity force between N particles then integrate orbits for N particles and repeat this process as necessary. Although it is simple, the force-calculation is a challenging task in regarding computational science. A simple and exact method to do the force-calculation requires O(N2) computational complexity, which is prohibitively compute intensive with large N. The exact force- calculation is necessary in some types of simulations such as a few-body problems, numerical integration of planets orbiting around a star (e.g., the Solar system), and evolution of dense star clusters. For simulations that do not require exact force, a several approximation techniques have been proposed [1]���[3]. The particle-mesh/particle-particle-mesh method [1] and the oct-tree method [2] reduce the computational complexity of the force-calculation to O(NlogN). The fast-multipole method [3] further reduces it to O(N). An computational technique to evaluate the exact force-calculation rapidly is to ask for a help of a special hardware like GRAPE [4], [5]. Precisely, the exact force-calculation is expressed as following equations ai = N summationdisplay j=1,j=i f(xi, xj, mj) = N summationdisplay j=1,j=i mj(xi - xj) (|xi - xj|2 + ��2)3/2 , pi = N summationdisplay j=1,j=i p(xi, xj, mj) = N summationdisplay j=1,j=i mj (|xi - xj|2 + ��2)1/2 , where ai and pi are force vector and potential for a particle i, and xi, mi, �� are position of a particle, the mass, and a parameter that prevents division by zero, respectively. It is apparent that force-calculation for each particles are independent. Therefore, the exact force-calculation is difficult but a massively parallel problem. In the GRAPE system, they have taken full advantage of this fact by computing different force in parallel with many computing pipelines. It is natural to take the same approach to utilize a recent graphic
Page 2
processing unit (GPU), which has many number of arithmetic units ��� 200 - 800, for the exact force- calculation. The rise of the GPU forces us to re-think a way of parallel computing on it since a performance of recent GPUs is impressive at 1 Tflops. Acceleration techniques for the exact force-calculation with GPU have been already reported ( [6] and many others). In this paper, we report our technique to speed-up the exact force-calculation on RV770 GPU from AMD/ATi. As far as we know, our implementation on RV770 GPU running at 750 MHz shows fastest performance of ��� 1 Tflops thanks to efficient cache architecture of RV770 GPU. Furthermore, a loop- unrolling technique is highly effective RV770 GPU. In the following sections, we briefly describe our method, implementation and performance. II. OUR COMPUTING SYSTEM WITH RV770 GPU Our computing system used in the present paper consists of a host computer and an extension board. A main component of the extension board is a GPU processor that acts as an accelerator attached to the host computer. A. Architecture of RV770 GPU RV770 processor from AMD/ATi is the company���s latest GPU (R700 architecture) with many enhance- ments for general purpose computing on GPU (GPGPU). It has 800 arithmetic units (called a stream core), each of which is capable of executing single precision floating-point (FP) multiply-add in one cycle. At the time of writing, the fastest RV770 processor is running at 750 MHz and offers a peak performance of 800 �� 2 �� 750 �� 106 = 1.2 Tflops. Internally, there are two types of the stream cores in the processor. One is a simple stream core that can execute only a FP multiply-add and integer operations and operates on 32 bit registers. Another is a transcendental stream core that can handle transcendental functions in addition to the above simple operations. Moreover, these units are organized hierarchically as follows. At one level higher from the stream cores, a five-way very long instruction word unit called a thread processor (TP), that consists of four simple stream cores and one transcendental stream core. Therefore, one RV770 processor has 160 TPs. The TP can execute either at most five single-precision/integer operations, four simple single-precision/integer operations with one transcendental operation, or double-precision operations by combinations of the four stream cores. Moreover, a unit called a SIMD engine consists of 16 TPs. Each SIMD engine has a memory region called a local data store that can be used to explicitly exchange data between TPs. At the top level RV770, there are 10 SIMD engines, a controller unit called an ultra-threaded dispatch processor, and other units such as units for graphic processing, memory controllers and DMA engines. An external memory attached to the RV770 in the present work is 1 GB GDDR5 memory with a bus width of 256 bit. It has a data clock rate at 3600 MHz and offers us a bandwidth of 115.2 GB sec���1. In addition to this large memory bandwidth, each SIMD engine on RV770 has two-level cache memory. Figure 1 shows a block diagram of RV770. The RV770 processor with memory chips is mounted on an extension board. The extension board is connected with a host computer through PCI-Express Gen2 x16 bus. A theoretical communication speed between the host computer and RV770 GPU is at most 8 GB sec���1 (in one-way). The measured communication speed of our system is ��� 5 - 6 GB sec���1 for data size larger than 1 MB. B. CAL for programming RV770 GPU After an introduction of unified shader on GPUs around 2000, it became possible to write programs on GPUs by using the shader languages such as HLSL, GLSL, and Cg. However those languages were not designed for general computing on GPU. Even though, an early attempt to implement the force-calculation has been reported [7]. In 2006, Nvidia inc. provided CUDA (Compute Unified Device Architecture), which
Readership Statistics
14 Readers on Mendeley
by Discipline
21% Engineering
by Academic Status
29% Ph.D. Student
21% Student (Master)
7% Lecturer
by Country
29% United States
14% Japan
14% China
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime



