waLBerla CUDA concepts
Similarities and Differences of CPU and GPU field
When writing a kernel that operates on a field, the first task is to distribute the data to CUDA threads and blocks. We need a function $(blockIdx, threadIdx) \rightarrow (x,y,z)$ or $(blockIdx, threadIdx) \rightarrow (x,y,z,f)$. The optimal mapping depends on many parameters: for example which layout the field has, the extends of each coordinate, hardware parameters like warp-size, etc. Thus this indexing function is abstracted. A few indexing strategies are already implemented which can be substituted by custom strategies. A indexing strategy consists of two classes: and somewhat complex Indexing class, which manages the indexing on the host-side and a lightweight Accessor class, which is passed to the CUDA kernel.
An indexing scheme is very similar to the iterator concept, it defines the bounds of the iteration, which is not necessarily the complete field but could also be a certain sub-block, for example the ghost layer in a certain direction.
Lets start to write a simple kernel that doubles all values stored in a field:
We do not have to care about indexing, the cuda::FieldAccessor takes care of that. So this is a generic kernel that operates on double fields. Using the cuda::FieldAccessor the current and neighboring values can be accessed and manipulated.
This kernel can be called like this:
In the example above we only iterate over a slice of the field. Of course we can also iterate over the complete field, there are various static member functions in a Indexing class to create certain iteration patterns. The Indexing class encapsulates the information of how to launch the kernel (blockDim and gridDim) and holds the Accessor class that is passed to the kernel.
Two indexing strategies are currently provided:
Wrapper class around a CUDA kernel, to call kernels also from code not compiled with nvcc. Example:
Why use this strange wrapper class instead of the nice kernel call syntax "<<<griddim, blockdim >>>" ??
Drawbacks of this class compared to kernel call syntax: Type checking of parameters can only be done at runtime (is done only in Debug mode!). Consider the following example:
The parameter types of the kernel and the parameters added at the cuda::Kernel class do not match. This is only detected when the code is run and was compiled in DEBUG mode!
Advantages of this class compared to kernel call syntax: Integrates nicely with waLBerlas field indexing and accessor concepts:
When using at least one FieldIndexingParameter configure() does not have to be called, since the thread and grid setup is done by the indexing scheme. If two FieldIndexingParameters are passed, the two indexing schemes have to be consistent.