Tutorial - CUDA 1: Game of Life on GPU

Note: This tutorial required a CUDA aware MPI library. If you get a SEGFAULT when executing this tutorial, make sure that your MPI library was built with CUDA support! For instructions how to build OpenMPI with CUDA see this page.

Creating Fields

To run a simulation on a NVIDIA graphics card, we have to allocate data on the GPU and write a CUDA kernel that operates on this data. In this tutorial we first allocate a field on the GPU and learn about functionality to transfer data between CPU and GPU fields.

Since initialization and output routines are usually not time critical, they are implemented for CPU fields only. In waLBerla we set up the complete simulation using CPU fields, copy the initialized fields over to the GPU, do the complete computation there, and, in the end, copy everything back to do the output from the CPU field. So only the time critical kernels have to be written in CUDA.

Thus the setup code of the GPU GameOfLife program is very similar to its CPU version, which was implemented in a previous tutorial ( Tutorial - Basics 3: Writing a Simple Cellular Automaton in waLBerla ). One difference is, that fields which are often transfered from/to the GPU should be allocated with a different field allocator: cuda::HostFieldAllocator . This allocator uses cudaHostAlloc() instead of "new" , such that the memory is marked "pinned", which means that it is always held in RAM and cannot be swapped out to disk. Data transfer from pinned memory is faster than from normal memory. The usage of this allocator is not mandatory, the data transfer functions work (slightly slower) also with normally allocated fields.

ScalarField * createField( IBlock* const block, StructuredBlockStorage* const storage )
{
return new ScalarField (
storage->getNumberOfXCells( *block ), // number of cells in x direction per block
storage->getNumberOfYCells( *block ), // number of cells in y direction per block
storage->getNumberOfZCells( *block ), // number of cells in z direction per block
1, // one ghost layer
real_t(0), // initial value
field::fzyx, // layout
make_shared<cuda::HostFieldAllocator<double> >() // allocator for host pinned memory
);
}

Now we initialize the CPU field just like in the previous tutorial tutorial_basics03 . Then two GPU fields are created: "source" and "destination" field. The helper function cuda::addGPUFieldToStorage() creates a cuda::GPUField field of the same size and layout of the given CPU field:

BlockDataID gpuFieldSrcID = cuda::addGPUFieldToStorage<ScalarField>( blocks, cpuFieldID, "GPU Field Src" );
BlockDataID gpuFieldDstID = cuda::addGPUFieldToStorage<ScalarField>( blocks, cpuFieldID, "GPU Field Dst" );

The contents of the new GPU fields are initialized with the contents of the given CPU field.

Writing and calling CUDA kernels

For a basic understanding of the CUDA support in waLBerla please read Overview of waLBerla CUDA support first.

After reading this page you should know what a FieldAccessor is and how to call CUDA kernels from cpp files. So we can now start with writing a CUDA kernel for the Game of Life algorithm. We place this in a separate file with ".cu" extension. The build system then automatically detects that this file should be compiled with the CUDA C++ compiler.

The kernel gets two field accessors as arguments, one for the source and one for the destination field. Both accessors have to be configured using the CUDA variables blockIdx and threadIdx, such that afterwards the get() and getNeighbor() functions of the accessor class can work correctly.

__global__ void gameOfLifeKernel( cuda::FieldAccessor<double> src, cuda::FieldAccessor<double> dst )
{
src.set( blockIdx, threadIdx );
dst.set( blockIdx, threadIdx );
int liveNeighbors = 0;
if ( src.getNeighbor( 1, 0,0 ) > 0.5 ) ++liveNeighbors;
if ( src.getNeighbor( -1, 0,0 ) > 0.5 ) ++liveNeighbors;
// normal Game of Life algorithm ....
// ...
}

To call this kernel we write a thin wrapper sweep which only has to get the GPU fields out of the blockstorage and passes them to the CUDA kernel. We use the cuda::Kernel class from waLBerla here, so that we can write this sweep in a normal cpp file. Here are the contents of this sweep:

auto srcCudaField = block->getData< cuda::GPUField<real_t> > ( gpuFieldSrcID_ );
auto dstCudaField = block->getData< cuda::GPUField<real_t> > ( gpuFieldDstID_ );
auto myKernel = cuda::make_kernel( &gameOfLifeKernel );
myKernel.addFieldIndexingParam( cuda::FieldIndexing<double>::xyz( *srcCudaField ) );
myKernel.addFieldIndexingParam( cuda::FieldIndexing<double>::xyz( *dstCudaField ) );
myKernel();
srcCudaField->swapDataPointers( dstCudaField );

All the computations are done on the GPU. The CPU field is not updated automatically! It was just used for setup reasons.

To see if our kernel works, we copy the contents back to the CPU field after every timestep:

timeloop.add() << Sweep( cuda::fieldCpyFunctor<ScalarField, GPUField >(cpuFieldID, gpuFieldDstID) );

Of course this makes no sense for real simulations, since the transfer time is much higher than the time that was saved by doing the computation on the GPU. For production runs, one would usually transfer the field back every n'th timestep and write e.g. a VTK frame.

Communication

In waLBerla there are two types of communication: buffered and direct communication. While buffered communication first collects all data in a buffer and sends only one message per communciation step and neighbor the direct communciation strategy, which is based on MPI datatypes, uses no intermediate buffers and therefore has to send more messages than buffered communication. For details see Distributed memory communication in waLBerla .

In the tutorials up to now, only the buffered approach was used. In this tutorial, we switch to the direct communciation strategy because then we can use the CUDA support of the MPI library to directly communciate from/to GPU memory.

The usage of the two different communication schemes is very similar. Instead of creating a blockforest::communication::UniformBufferedScheme we create a blockforest::communication::UniformDirectScheme. Then we register a field::communication::UniformMPIDatatypeInfo instead of the field::communication::PackInfo.

typedef blockforest::communication::UniformDirectScheme<stencil::D2Q9 > CommScheme;
CommScheme communication( blocks );
communication.addDataToCommunicate( make_shared<field::communication::UniformMPIDatatypeInfo<GPUField> > (gpuFieldSrcID) );

This scheme also supports heterogenous simulations, i.e. using a CPU field on some processes and a GPU field on other processes.

static FieldIndexing< T > xyz(const GPUField< T > &f)
Definition: FieldIndexing.impl.h:143
GhostLayerField< double, 1 > ScalarField
Definition: 01_GameOfLife_cuda.cpp:53
@ fzyx
Value-sorted data layout (f should be outermost loop)
Definition: Layout.h:34
Kernel< FuncPtr > make_kernel(FuncPtr funcPtr)
Definition: Kernel.h:160
float real_t
Definition: DataTypes.h:163
FuncCreator< void(IBlock *) > Sweep
Definition: SelectableFunctionCreators.h:131
GhostLayerField< real_t, 1 > ScalarField
Definition: 01_CodegenHeatEquation.cpp:41
__global__ void gameOfLifeKernel(cuda::FieldAccessor< double > src, cuda::FieldAccessor< double > dst)
ScalarField * createField(IBlock *const block, StructuredBlockStorage *const storage)
Definition: 01_GameOfLife_cuda.cpp:57