In this tutorial, we will implement Conway's Game of Life, the algorithm which made cellular automata popular on graphics processing units (GPUs). This tutorial runs on NVIDIA GPUs with CUDA but can also run on AMD GPUs using HIP. waLBerla fully supports both libraries. For a basic understanding of the GPU support in waLBerla please read Introduction to GPU Programming with waLBerla first.
This tutorial is an extension of Tutorial - Basics 3: Writing a Simple Cellular Automaton in waLBerla to GPUs.
To run a simulation on a graphics processing unit (GPU), we have to allocate data on the GPU and write a kernel that operates on this data. In this tutorial we first allocate a field on the GPU and learn about functionality to transfer data between CPU and GPU fields.
Since initialization and output routines are usually not time critical, they are implemented for CPU fields only. In waLBerla we set up the complete simulation using CPU fields, copy the initialized fields over to the GPU, do the complete computation there, and, in the end, copy everything back to do the output from the CPU field. So only the time critical kernels have to be written for GPU.
Thus the setup code of the GPU GameOfLife program is very similar to its CPU version, which was implemented in a previous tutorial ( Tutorial - Basics 3: Writing a Simple Cellular Automaton in waLBerla ). One difference is, that fields which are often transferred from/to the GPU should be allocated with a different field allocator: gpu::HostFieldAllocator . This allocator uses gpuHostAlloc() instead of "new" , such that the memory is marked "pinned", which means that it is always held in RAM and cannot be swapped out to disk. Data transfer from pinned memory is faster than from normal memory. The usage of this allocator is not mandatory, the data transfer functions work (slightly slower) also with normally allocated fields.
Now we initialize the CPU field just like in the previous tutorial tutorial_basics03 . Then two GPU fields are created: "source" and "destination" field. The helper function gpu::addGPUFieldToStorage() creates a gpu::GPUField field of the same size and layout of the given CPU field:
The contents of the new GPU fields are initialized with the contents of the given CPU field.
After reading this page you should know what a FieldAccessor is and how to call GPU. So we can now start with writing a kernel for the Game of Life algorithm. We place this in a separate file with ".cu" extension (This is basically the only part that is different between CUDA and HIP). The build system then automatically detects that this file should be compiled with the CUDA C++ compiler.
The kernel gets two field accessors as arguments, one for the source and one for the destination field. Both accessors have to be configured using the variables blockIdx and threadIdx from the CUDA or HIP library, such that afterwards the get() and getNeighbor() functions of the accessor class can work correctly.
To call this kernel we create a gpu::FieldIndexing object that receives a pointer to GPU fields. With this the blockDim and gridDim can be obtained as well as gpuAccess objects that contain the neighbouring information needed inside the GPU kernel. The kernel can be called normally with the three angle brackets.
All the computations are done on the GPU. The CPU field is not updated automatically! It was just used for setup reasons.
To see if our kernel works, we create a VTK writer. The VTK writer works on the CPU field. Thus it works exactly as in other examples. However, since our data is on GPU we need a addBeforeFunction
that copies our data from host to device. This is done using the gpu::fieldCpyFunctor. Note that copying data is costly and thus we don't want to do this in every timestep usually. In this example it is only done every second timestep.
For this tutorial we use the gpu::communication::UniformGPUScheme that first collects all data in a buffer and sends only one message per communication step and neighbor. For the PackInfo we use the MemcpyPackInfo. It receives a buffer located on the GPU and fills it using memcpy operations If the GPU library is build with MPI support this buffer can be send to other GPUs without a copy to the CPU. Otherwise the copying will be done in the back by the communication class.
To run the simulation we would like to point out a few common pitfalls to avoid. Basically it works very similar than the CPU equivalent. Since all Sweeps and Function calls are registered by the timeloop we can run the simulation using timeloop.run();
. However, it is important to point out that kernel calls are asynchronous. Thus for time measurement purpose we need to make sure that all kernels are executed before stopping the timer. This can be done using gpuDeviceSynchronize
. For good measure we also run this function right before starting the timer.