Overview of waLBerla CUDA support

waLBerla CUDA concepts

# Fields on GPU

## Creating GPU fields and copy them between host and device

// create a CPU field and a GPU field of same size and with same layout
GhostLayerField<double,4> h_f ( 16,20,30, 1, 42.0, field::fzyx );
cuda::GPUField<double> d_f ( 16,20,30, 4, 1, field::fzyx );
cuda::fieldCpy( d_f, h_f ); // copy from host to device
some_kernel_wrapper( d_f ); // run some kernel
cuda::fieldCpy( h_f, d_f ); // copy field data back to host

Similarities and Differences of CPU and GPU field

• cuda::GPUField corresponds to field::GhostLayerField
• fSize is a template parameter for CPU fields and a normal parameter for GPUFields
• CPU field iterators correspond to FieldAccessors (see next section)

## Writing CUDA kernels operating on GPUFields

Accessing fields in CUDA kernels

When writing a kernel that operates on a field, the first task is to distribute the data to CUDA threads and blocks. We need a function $(blockIdx, threadIdx) \rightarrow (x,y,z)$ or $(blockIdx, threadIdx) \rightarrow (x,y,z,f)$. The optimal mapping depends on many parameters: for example which layout the field has, the extends of each coordinate, hardware parameters like warp-size, etc. Thus this indexing function is abstracted. A few indexing strategies are already implemented which can be substituted by custom strategies. A indexing strategy consists of two classes: and somewhat complex Indexing class, which manages the indexing on the host-side and a lightweight Accessor class, which is passed to the CUDA kernel.

An indexing scheme is very similar to the iterator concept, it defines the bounds of the iteration, which is not necessarily the complete field but could also be a certain sub-block, for example the ghost layer in a certain direction.

Lets start to write a simple kernel that doubles all values stored in a field:

__global__ void kernel_double( cuda::FieldAccessor<double> f )
{
f.set( blockIdx, threadIdx );
f.get() *= 2.0;
}

We do not have to care about indexing, the cuda::FieldAccessor takes care of that. So this is a generic kernel that operates on double fields. Using the cuda::FieldAccessor the current and neighboring values can be accessed and manipulated.

This kernel can be called like this:

cuda::FieldIndexing<double> indexing = cuda::FieldIndexing<double>::sliceBeforeGhostLayerXYZ( field, 1, stencil::E, true );
kernel_double<<< iter.gridDim(), iter.blockDim() >>> ( iter.gpuAccess() );

In the example above we only iterate over a slice of the field. Of course we can also iterate over the complete field, there are various static member functions in a Indexing class to create certain iteration patterns. The Indexing class encapsulates the information of how to launch the kernel (blockDim and gridDim) and holds the Accessor class that is passed to the kernel.

Two indexing strategies are currently provided:

# Calling CUDA kernels from CPP files

Wrapper class around a CUDA kernel, to call kernels also from code not compiled with nvcc. Example:

// Declaration of kernel, implementation has to be in a file compiled with nvcc
void kernel_func ( double * inputData, int size );
auto kernel = make_kernel( kernel_func );
kernel.addParam<double*> ( argument1 );
kernel.addParam<int> ( 20 );
kernel.configure( dim3( 3,3,3), dim3( 4,4,4) );
kernel();
// this code is equivalent to:
kernel_func<<< dim3( 3,3,3), dim3( 4,4,4) >> ( argument1, 20 );

Why use this strange wrapper class instead of the nice kernel call syntax "<<<griddim, blockdim >>>" ??

• This syntax is nice but has to be compiled with nvcc, which does not (yet) understand C++11
• C++11 features are used all over the place in waLBerla code
• all *.cu files and headers included in *.cu files have to be "C++11 free"
• thus there should be as few code as possible in *.cu files

Drawbacks of this class compared to kernel call syntax: Type checking of parameters can only be done at runtime (is done only in Debug mode!). Consider the following example:

// Declaration of kernel, implementation has to be in a file compiled with nvcc
void kernel_func ( double * inputData, int size );
auto kernel = make_kernel( kernel_func );
kernel.addParam<float*> ( argument1 );
kernel.addParam<unsigned int> ( 40 );
kernel.configure( dim3( 3,3,3), dim3( 4,4,4) );
kernel();
// this code is equivalent to:
kernel_func<<< dim3( 3,3,3), dim3( 4,4,4) >> ( argument1, 20 );

The parameter types of the kernel and the parameters added at the cuda::Kernel class do not match. This is only detected when the code is run and was compiled in DEBUG mode!

Advantages of this class compared to kernel call syntax: Integrates nicely with waLBerlas field indexing and accessor concepts:

void kernel_func( cuda::SimpleFieldAccessor<double> f );
auto myKernel = cuda::make_kernel( &kernel_double );
myKernel.addFieldIndexingParam( cuda::SimpleFieldIndexing<double>::xyz( gpuField ) );
myKernel();

When using at least one FieldIndexingParameter configure() does not have to be called, since the thread and grid setup is done by the indexing scheme. If two FieldIndexingParameters are passed, the two indexing schemes have to be consistent.

static FieldIndexing< T > sliceBeforeGhostLayerXYZ(const GPUField< T > &f, uint_t thickness, stencil::Direction dir, bool fullSlice=false)
Definition: FieldIndexing.impl.h:176
void fieldCpy(const shared_ptr< StructuredBlockStorage > &blocks, BlockDataID dstID, ConstBlockDataID srcID)
Definition: FieldCopy.h:41
@ fzyx
Value-sorted data layout (f should be outermost loop)
Definition: Layout.h:34
Kernel< FuncPtr > make_kernel(FuncPtr funcPtr)
Definition: Kernel.h:160
@ E
East.
Definition: Directions.h:49