Overview

In this tutorial, we will implement Conway's Game of Life, the algorithm which made cellular automata popular on graphics processing units (GPUs). This tutorial runs on NVIDIA GPUs with CUDA but can also run on AMD GPUs using HIP. waLBerla fully supports both libraries. For a basic understanding of the GPU support in waLBerla please read Introduction to GPU Programming with waLBerla first.

This tutorial is an extension of Tutorial - Basics 3: Writing a Simple Cellular Automaton in waLBerla to GPUs.

Creating Fields

To run a simulation on a graphics processing unit (GPU), we have to allocate data on the GPU and write a kernel that operates on this data. In this tutorial we first allocate a field on the GPU and learn about functionality to transfer data between CPU and GPU fields.

Since initialization and output routines are usually not time critical, they are implemented for CPU fields only. In waLBerla we set up the complete simulation using CPU fields, copy the initialized fields over to the GPU, do the complete computation there, and, in the end, copy everything back to do the output from the CPU field. So only the time critical kernels have to be written for GPU.

Thus the setup code of the GPU GameOfLife program is very similar to its CPU version, which was implemented in a previous tutorial ( Tutorial - Basics 3: Writing a Simple Cellular Automaton in waLBerla ). One difference is, that fields which are often transferred from/to the GPU should be allocated with a different field allocator: gpu::HostFieldAllocator . This allocator uses gpuHostAlloc() instead of "new" , such that the memory is marked "pinned", which means that it is always held in RAM and cannot be swapped out to disk. Data transfer from pinned memory is faster than from normal memory. The usage of this allocator is not mandatory, the data transfer functions work (slightly slower) also with normally allocated fields.

auto hostFieldAllocator = make_shared< gpu::HostFieldAllocator<real_t> >();

BlockDataID const cpuFieldID =field::addToStorage< ScalarField >(blocks, "CPU Field", real_c(0.0), field::fzyx, uint_c(1), hostFieldAllocator);

walberla::field::fzyx

@ fzyx

Value-sorted data layout (f should be outermost loop)

Definition Layout.h:34

walberla::field::addToStorage

BlockDataID addToStorage(const shared_ptr< BlockStorage_T > &blocks, const std::string &identifier, const typename GhostLayerField_T::value_type &initValue=typename GhostLayerField_T::value_type(), const Layout layout=fzyx, const uint_t nrOfGhostLayers=uint_t(1), const bool alwaysInitialize=false, const std::function< void(GhostLayerField_T *field, IBlock *const block) > &initFunction=std::function< void(GhostLayerField_T *field, IBlock *const block) >(), const Set< SUID > &requiredSelectors=Set< SUID >::emptySet(), const Set< SUID > &incompatibleSelectors=Set< SUID >::emptySet())

Definition AddToStorage.h:151

walberla::uint_c

uint_t uint_c(T t)

cast to type uint_t using "uint_c(x)"

Definition DataTypes.h:140

walberla::real_c

real_t real_c(T t)

cast to type real_t using "real_c(x)"

Definition DataTypes.h:211

Now we initialize the CPU field just like in the previous tutorial Tutorial - Basics 3: Writing a Simple Cellular Automaton in waLBerla . Then two GPU fields are created: "source" and "destination" field. The helper function gpu::addGPUFieldToStorage() creates a gpu::GPUField field of the same size and layout of the given CPU field:

BlockDataID gpuFieldSrcID = gpu::addGPUFieldToStorage<ScalarField>( blocks, cpuFieldID, "GPU Field Src" );

BlockDataID gpuFieldDstID = gpu::addGPUFieldToStorage<ScalarField>( blocks, cpuFieldID, "GPU Field Dst" );

walberla::gpu::addGPUFieldToStorage

BlockDataID addGPUFieldToStorage(const shared_ptr< StructuredBlockStorage > &bs, const std::string &identifier, uint_t fSize, const Layout layout=fzyx, uint_t nrOfGhostLayers=1, bool usePitchedMem=true)

Adds a gpu::GPUField to a StructuredBlockStorage.

Definition AddGPUFieldToStorage.impl.h:70

The contents of the new GPU fields are initialized with the contents of the given CPU field.

Writing and calling CUDA kernels

After reading this page you should know what a FieldAccessor is and how to call GPU. So we can now start with writing a kernel for the Game of Life algorithm. We place this in a separate file with ".cu" extension (This is basically the only part that is different between CUDA and HIP). The build system then automatically detects that this file should be compiled with the CUDA C++ compiler.

The kernel gets two field accessors as arguments, one for the source and one for the destination field. Both accessors have to be configured using the variables blockIdx and threadIdx from the CUDA or HIP library, such that afterwards the get() and getNeighbor() functions of the accessor class can work correctly.

__global__ void gameOfLifeKernel( gpu::FieldAccessor<double> src, gpu::FieldAccessor<double> dst  )
{
   src.set( blockIdx, threadIdx );
   dst.set( blockIdx, threadIdx );
   int liveNeighbors = 0;
   if ( src.getNeighbor(  1, 0,0 ) > 0.5 ) ++liveNeighbors;
   if ( src.getNeighbor( -1, 0,0 ) > 0.5 ) ++liveNeighbors;
   // normal Game of Life algorithm ....
   // ...
}

To call this kernel we create a gpu::FieldIndexing object that receives a pointer to GPU fields. With this the blockDim and gridDim can be obtained as well as gpuAccess objects that contain the neighbouring information needed inside the GPU kernel. The kernel can be called normally with the three angle brackets.

auto srcCudaField = block->getData< gpu::GPUField<real_t> > ( gpuFieldSrcID_ );
auto dstCudaField = block->getData< gpu::GPUField<real_t> > ( gpuFieldDstID_ );
 
auto srcIndexing = gpu::FieldIndexing<real_t>::xyz( *srcCudaField );
auto dstIndexing = gpu::FieldIndexing<real_t>::xyz( *dstCudaField );
 
auto srcAccess = srcIndexing.gpuAccess();
auto dstAccess = dstIndexing.gpuAccess();
 
const dim3 gridDim = srcIndexing.gridDim();
const dim3 blockDim = srcIndexing.blockDim();
 
gameOfLifeKernel<<<gridDim, blockDim, 0, nullptr >>>(srcAccess, dstAccess );
 
srcCudaField->swapDataPointers( dstCudaField );

All the computations are done on the GPU. The CPU field is not updated automatically! It was just used for setup reasons.

VTK Output

To see if our kernel works, we create a VTK writer. The VTK writer works on the CPU field. Thus it works exactly as in other examples. However, since our data is on GPU we need a addBeforeFunction that copies our data from host to device. This is done using the gpu::fieldCpyFunctor. Note that copying data is costly and thus we don't want to do this in every timestep usually. In this example it is only done every second timestep.

const uint_t vtkWriteFrequency = 2;
if (vtkWriteFrequency > 0)
{
   auto vtkOutput = vtk::createVTKOutput_BlockData(*blocks, "vtk", vtkWriteFrequency);
   vtkOutput->addBeforeFunction(gpu::fieldCpyFunctor<ScalarField, GPUField >(blocks, cpuFieldID, gpuFieldDstID));
 
   auto dataWriter = make_shared< field::VTKWriter< ScalarField > >(cpuFieldID, "output");
   vtkOutput->addCellDataWriter(dataWriter);
   timeloop.addFuncBeforeTimeStep(vtk::writeFiles(vtkOutput), "VTK Output");
}

Communication

For this tutorial we use the gpu::communication::UniformGPUScheme that first collects all data in a buffer and sends only one message per communication step and neighbor. For the PackInfo we use the gpu::communication::MemcpyPackInfo. It receives a buffer located on the GPU and fills it using memcpy operations. If the GPU library is built with MPI support this buffer can be sent to other GPUs without a copy to the CPU. Otherwise the copying will be done in the back by the communication class.

using CommScheme = gpu::communication::UniformGPUScheme<stencil::D2Q9 > ;
using Packing = gpu::communication::MemcpyPackInfo<GPUField> ;
const bool sendDirectlyFromGPU = false;
CommScheme commScheme(blocks, sendDirectlyFromGPU);
commScheme.addPackInfo( make_shared<Packing>(gpuFieldSrcID) );

Running the simulation

To run the simulation we would like to point out a few common pitfalls to avoid. Basically it works very similar than the CPU equivalent. Since all Sweeps and Function calls are registered by the timeloop we can run the simulation using timeloop.run();. However, it is important to point out that kernel calls are asynchronous. Thus for time measurement purpose we need to make sure that all kernels are executed before stopping the timer. This can be done using gpuDeviceSynchronize. For good measure we also run this function right before starting the timer.

WcTimer simTimer;
WALBERLA_GPU_CHECK( gpuDeviceSynchronize() )
simTimer.start();
timeloop.run();
WALBERLA_GPU_CHECK( gpuDeviceSynchronize() )
simTimer.end();
WALBERLA_LOG_INFO_ON_ROOT("Simulation finished")
auto time      = real_c(simTimer.last());
WALBERLA_LOG_RESULT_ON_ROOT("Game of life tutorial finished. Elapsed time " << time)