Note: This tutorial required a CUDA aware MPI library. If you get a SEGFAULT when executing this tutorial, make sure that your MPI library was built with CUDA support! For instructions how to build OpenMPI with CUDA see this page.
To run a simulation on a NVIDIA graphics card, we have to allocate data on the GPU and write a CUDA kernel that operates on this data. In this tutorial we first allocate a field on the GPU and learn about functionality to transfer data between CPU and GPU fields.
Since initialization and output routines are usually not time critical, they are implemented for CPU fields only. In waLBerla we set up the complete simulation using CPU fields, copy the initialized fields over to the GPU, do the complete computation there, and, in the end, copy everything back to do the output from the CPU field. So only the time critical kernels have to be written in CUDA.
Thus the setup code of the GPU GameOfLife program is very similar to its CPU version, which was implemented in a previous tutorial ( Tutorial - Basics 3: Writing a Simple Cellular Automaton in waLBerla ). One difference is, that fields which are often transfered from/to the GPU should be allocated with a different field allocator: cuda::HostFieldAllocator . This allocator uses cudaHostAlloc() instead of "new" , such that the memory is marked "pinned", which means that it is always held in RAM and cannot be swapped out to disk. Data transfer from pinned memory is faster than from normal memory. The usage of this allocator is not mandatory, the data transfer functions work (slightly slower) also with normally allocated fields.
Now we initialize the CPU field just like in the previous tutorial tutorial_basics03 . Then two GPU fields are created: "source" and "destination" field. The helper function cuda::addGPUFieldToStorage() creates a cuda::GPUField field of the same size and layout of the given CPU field:
The contents of the new GPU fields are initialized with the contents of the given CPU field.
For a basic understanding of the CUDA support in waLBerla please read Overview of waLBerla CUDA support first.
After reading this page you should know what a FieldAccessor is and how to call CUDA kernels from cpp files. So we can now start with writing a CUDA kernel for the Game of Life algorithm. We place this in a separate file with ".cu" extension. The build system then automatically detects that this file should be compiled with the CUDA C++ compiler.
The kernel gets two field accessors as arguments, one for the source and one for the destination field. Both accessors have to be configured using the CUDA variables blockIdx and threadIdx, such that afterwards the get() and getNeighbor() functions of the accessor class can work correctly.
To call this kernel we write a thin wrapper sweep which only has to get the GPU fields out of the blockstorage and passes them to the CUDA kernel. We use the cuda::Kernel class from waLBerla here, so that we can write this sweep in a normal cpp file. Here are the contents of this sweep:
All the computations are done on the GPU. The CPU field is not updated automatically! It was just used for setup reasons.
To see if our kernel works, we copy the contents back to the CPU field after every timestep:
Of course this makes no sense for real simulations, since the transfer time is much higher than the time that was saved by doing the computation on the GPU. For production runs, one would usually transfer the field back every n'th timestep and write e.g. a VTK frame.
In waLBerla there are two types of communication: buffered and direct communication. While buffered communication first collects all data in a buffer and sends only one message per communciation step and neighbor the direct communciation strategy, which is based on MPI datatypes, uses no intermediate buffers and therefore has to send more messages than buffered communication. For details see Distributed memory communication in waLBerla .
In the tutorials up to now, only the buffered approach was used. In this tutorial, we switch to the direct communciation strategy because then we can use the CUDA support of the MPI library to directly communciate from/to GPU memory.
The usage of the two different communication schemes is very similar. Instead of creating a blockforest::communication::UniformBufferedScheme we create a blockforest::communication::UniformDirectScheme. Then we register a field::communication::UniformMPIDatatypeInfo instead of the field::communication::PackInfo.
This scheme also supports heterogenous simulations, i.e. using a CPU field on some processes and a GPU field on other processes.