Memory Hierarchy Optimization for Large Tridiagonal System Solvers on GPU
Nowadays GPUs are commodity hardware containing hundreds of cores and supporting thousands of threads that can be used to accelerate a wide range of applications. From a programmer's perspective, GPUs offer a stream processing model which requires the application of new techniques to exploit their capabilities. In this paper we present the application of the split-and-merge technique to the following parallel tridiagonal system solvers on the GPU: cyclic reduction and recursive doubling. The split-and-merge technique naturally splits the algorithm flow in parallel paths that can be solved in shared memory, and later merged in global memory. In this way, we can solve large systems of equations efficiently exploiting the memory hierarchy of the GPU. The results obtained show a significant acceleration compared with the direct implementation of the algorithms on the GPU.
keywords: GPGPU, CUDA, tridiagonal system solver, cyclic reduction, recursive doubling