This tutorial demonstrates how to use pystencils and lbmpy to generate highly optimised and hardware-specific Lattice Boltzmann simulation code within the waLBerla framework. Other than in Tutorial - Code Generation 2: Lattice Model Generation with lbmpy, we will be generating a full LBM sweep instead of a lattice model class. Furthermore, we will generate a communication pack info class and a sweep to initialise the PDF field. A hardware-specific implementation of a NoSlip boundary handler will also be generated. Those components will then be combined in a waLBerla application for simulating the same shear flow scenario as in the previous tutorial.
For large-scale LB simulations, the highly parallel design of a general-purpose graphics processing unit (GPGPU) can yield significant improvements in performance. The waLBerla framework relies on CUDA to run simulations on NVIDIA GPUs. In this tutorial, we will also show how code generation can be used to generate native CUDA code for different kinds of kernels.
In this tutorial, we will be using the more advanced cumulant-based multiple-relaxation-time (MRT) collision operator. Instead of relaxing the entire distribution functions toward their equilibrium values, their cumulants are relaxed with individual relaxation rates. We will also use the D2Q9 velocity set. For this velocity set, the zeroth- and first-order cumulants correspond to density and momentum which are conserved during collisions, so their relaxation rates can be set to zero. We will specify one common relaxation rate \( \omega \) for the three second-order cumulants to ensure the correct viscosity of the fluid; the higher-order cumulants will be set to their equilibrium values which correspond to a relaxation rate of 1.
We start by defining some general parameters as well as SymPy symbols and pystencils fields for later usage. Those include the stencil (
D2Q9), the memory layout (
fzyx, see Tutorial - Code Generation 1: Explicit Solver for the Heat Equation) and the symbol \( \omega \) for the relaxation rate of the second-order cumulants.
For the stream-pull-collide type kernel, we need two PDF fields which we set up using pystencils. We will use the built-in field swapping functionality of
generate_sweep. It will generate the pointer swapping for the temporary destination field with the source field after each time step. This requires us to obey a specific naming convention: The temporary field must have the same field name (passed as a string to
ps.fields) as the original field but with a suffix
_tmp. Otherwise, an error will occur during compilation.
For VTK output and the initial velocity setup, we define a velocity vector field as an output field for the LB method.
We set up the cumulant-based MRT method with relaxation rates as described above. We use
generate_lb_update_rule from lbmpy to derive the set of equations describing the collision operator together with the pull streaming pattern. These equations define the entire LBM sweep.
In Tutorial - Code Generation 2: Lattice Model Generation with lbmpy, we were able to use the framework built around the waLBerla lattice model template API for setting up the shear flow's initial velocity profile. Since we are not using a lattice model class this time, this API is not available to us. With lbmpy, though, we can generate a kernel which takes in scalar values or fields for the initial density and velocity and sets the initial PDF values to the corresponding equilibrium. The function
lbmpy.macroscopic_value_kernels returns a set of assignments for this initialization procedure. It takes the LB method definition as an argument, as well as either symbols or pystencils field accesses for the initial density
rho and the initial velocity. Lastly, it takes the PDF field's centre vector as the destination for the PDF values. We define a separate symbol for the density and use the velocity field defined above.
Everything is now prepared to generate the actual C++ code. We create the code generation context and evaluate the
ctx.cuda flag to find out if waLBerla is configured to build GPU code. If CUDA is enabled, we set the
gpu; otherwise to
cpu. The target is then passed to all code generation functions. If GPU code is to be generated, the generated classes will be implemented in
*.cu files, and their sweeps will run on the GPU.
Several functions from
lbmpy_walberla are called to generate the classes:
generate_sweep. This function takes an additional parameter
field_swapswhich takes a list of pairs of fields to generate the pointer swapping.
generate_pack_info_from_kernelwhich infers from the update rule's write accesses the PDF indices that need to be communicated. Without further specification, it assumes a pull-type kernel.
pdfs_setterassignment collection using
generate_boundary, we generate an optimised implementation of a NoSlip boundary handler for the domain's walls.
As in Tutorial - Code Generation 2: Lattice Model Generation with lbmpy, the classes generated by the above code need to be registered with CMake using the
walberla_generate_target_from_python macro. Since the source file extension is different if CUDA code is generated (
*.cu instead of
*.cpp), the code generation target needs to be added twice. During the build process, the correct target is selected through the surrounding
if(WALBERLA_BUILD_WITH_CUDA) block. Furthermore, the application depends on
cuda, which is used from the waLBerla backend.
We will now integrate the generated classes into a waLBerla application. After adding the code generation target as a CMake dependency, we can include their header files:
We set up typedef aliases for the generated pack info and the D2Q9 stencil. For the PDF and velocity fields, we use instances of the field::GhostLayerField template. The number of entries of the PDF field is specified by the
Stencil_T::Size parameter. As our domain is two-dimensional, the velocity at each lattice node is a two-dimensional vector. Thus, we set up the velocity field to have two index dimensions passing the stencil's dimension as a template parameter. Finally, we also define a typedef alias for our generated NoSlip boundary.
The PDF field, the velocity field and the flag field which stores information about the cell types for boundary handling are set up using field::addToStorage. The memory layout
fzyx is explicitly set for the PDF and velocity fields. We set initial values of
0.0 for both of them since they will be explicitly initialised later on.
The generated pack info class can be used in exactly the same way as a waLBerla-native pack info. It simply needs to be registered with the communication scheme by calling
Upon construction, the generated sweep class requires the PDF field, the velocity field and the relaxation rate
omega. The parameter
omega is read from a parameter file. After creation, the sweep functor can be passed directly to the timeloop.
The generated NoSlip boundary works independently of the native waLBerla LBM boundary handling classes, which are templated for the use with lattice models. Still, we use the functions exposed by the geometry module to initialise the flag field from the parameter file. The NoSlip boundary handler is constructed passing the block grid and the PDF field's identifier. After the flag field has been set up, the
fillFromFlagField method is called to register the boundary cells with the boundary handler.
The generated boundary handler holds an internal list of tuples
(x, y, dir), the so-called index vector. Each entry corresponds to a fluid cell which is a direct neighbour of a boundary cell. The direction pointing towards the boundary cell and the position is saved for each of these fluid cells. In our scenario, the north and south domain borders are marked as NoSlip-boundaries. The boundary cells are located on the northern and southern ghost layers, so the top and bottom row of fluid cells are neighbours to these boundaries. Thus, in the index vector, for all cells of the bottom row the tuples
(x, 0, 2),
(x, 0, 7) and
(x, 0, 8) are saved. The indices 2, 7 and 8 correspond to the South, South-East and South-West directions of the D2Q9 lattice, respectively. The boundary sweep then runs over this index vector, reflecting the outgoing populations in the stored directions off the wall according to the NoSlip rule.
The NoSlip sweep can now be added to the timeloop before the LBM sweep. We also register the communication functor as a BeforeFunction of the boundary sweep, since it needs to be run first in every time step.
Setting up the shear flow's initial velocity profile consists of two parts. First, the velocity field is initialised to the shear flow. Second, the prepared velocity field is passed to the generated PDF field setter, which then initialises the equilibrium PDF values from the velocities. The density is uniformly set to unity.
initShearFlowVelocityField function takes a pointer to the block storage, the velocity field's identifier and a handle to a parameter file block. It then reads the magnitudes of the velocity's $x$-component and the perturbation in the $y$-direction. For the random noise, a random number generator is created with a seed specified in the parameter file. Then, the function iterates over all blocks (distributed to MPI processes) and cells. A cell's velocity is then set according to its global position.
After the velocity field has been initialised, the generated
InitialPDFsSetter sweep is created and run once on each block. After this step, the PDF field will be initialised with the correct equilibrium distributions for the first iteration.
The simulation is now ready to run.
If CUDA is enabled, some implementation details need to be different from a CPU-only version. This mainly concerns the creation and management of fields, MPI communication and VTK output. Since the initialisation, LBM and NoSlip sweeps run entirely on the GPU, the PDF field has to be set up only in graphics memory. In contrast to that is the velocity field required by CPU and GPU. The shear flow velocity profile is constructed by CPU code before the initialisation kernel maps it onto the PDF field on the GPU. Also, the VTK output routines which run on the CPU need to read the velocity field. It thus needs to be created twice: Once in the main memory, and once in GPU memory. It is then copied on-demand from the GPU to the CPU. Furthermore, we create a flag field, which is only needed on the CPU. After the initialisation, we use it to create the index-vectors for the boundary-handling. The index vectors are then transferred to the GPU and not the entire flag field.
For the largest part, though, the C++ code is identical. The code snippets presented above represent only the CPU variant of the code. The GPU implementation can be found in the source file 03_AdvancedLBMCodegen.cpp. There, code blocks which are different from the CPU to the GPU implementation are toggled via preprocessor conditionals.
We have now successfully implemented a waLBerla LBM simulation application with an advanced collision operator, which can be specialised for both CPU and GPU hardware targets. Other possible extensions would be the use of advanced streaming patterns like the AA-pattern or EsoTwist to reduce the simulation's memory footprint. Also, lbmpy gives us the tools to develop advanced lattice Boltzmann methods for many kinds of applications. Thus, the basic principles demonstrated in these tutorials can be used for creating more complicated simulations with optimised lattice Boltzmann code.