We present an efficient approach for the development of 3-D partial differential equation solvers that are able to tackle the hardware limit of modern GPUs and scale to the world's largest supercomputers. The approach relies on the accelerated pseudo-transient method and on the automatic generation of on-chip memory usage-optimized computation kernels for the implementation. We report performance and scaling results on LUMI and Piz Daint, an AMD-GPU and a NVIDIA-GPU supercomputer.
For the development of massively scalable high performance 3-D partial differential equation (PDE) solvers we rely on the usage of a powerful matrix-free pseudo-transient iterative algorithm. The algorithm allows for formulating the update rules for fields in a form close to the explicit time integration. This formulation is optimally suited for both shared and distributed memory parallelization.
With respect to implementation, our approach is instantiated in the packages ParallelStencil.jl and ImplicitGlobalGrid.jl. It is fully architecture-agnostic and, for GPU, it includes the automatic generation of on-chip memory usage-optimized computation kernels as well as the automatic determination of their launch parameters. On-chip memory usage-optimization is achieved by explicitly reading large arrays through the memory shared by the threads of a block - when beneficial and when hardware resources allow for it - and by steering register usage in order to keep data that is to be reused whenever possible on the chip. The optimizations are applicable to complex real world applications: a simplified, less on-chip memory requiring optimization strategy can be used for a part of the kernel input arrays when the most aggressive optimization strategy would lead to a resource exhaustion. Kernel launch parameters as blocks, threads and required shared memory can be automatically defined in agreement with the applied optimizations. Furthermore, communication can be automatically hidden behind computation, including with on-chip memory usage-optimization activated.
We compare the performance achieved for some representative computation kernels against the performance obtained using alternative ways to express equivalent computations in Julia, for example using GPU array broadcasting or straightforward CUDA.jl kernel programming. Finally, we report the scaling of some earth science applications on the European flagship supercomputer LUMI at CSC in Finland, hosting AMD GPUs, and on the Piz Daint supercomputer at the Swiss National Supercomputing Centre in Switzerland, hosting NVIDIA GPUs.
Co-authors: Ludovic Räss¹ ², Ivan Utkin¹
¹ ETH Zurich | ² Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)