Overview

This tutorial demonstrates how to use pystencils and lbmpy to generate highly optimised and hardware-specific Lattice Boltzmann simulation code within the waLBerla framework. Other than in Tutorial - Code Generation 2: Lattice Model Generation with lbmpy, we will be generating a full LBM sweep instead of a lattice model class. Furthermore, we will generate a communication pack info class and a sweep to initialise the PDF field. A hardware-specific implementation of a NoSlip boundary handler will also be generated. Those components will then be combined in a waLBerla application for simulating the same shear flow scenario as in the previous tutorial.

For large-scale LB simulations, the highly parallel design of a general-purpose graphics processing unit (GPGPU) can yield significant improvements in performance. The waLBerla framework relies on CUDA/HIP to run simulations on NVIDIA or AMD GPUs. In this tutorial, we will also show how code generation can be used to generate native CUDA/HIP code for different kinds of kernels.

In this tutorial, we will be using the more advanced cumulant-based multiple-relaxation-time (MRT) collision operator. Instead of relaxing the entire distribution functions toward their equilibrium values, their cumulants are relaxed with individual relaxation rates. We will also use the D2Q9 velocity set. For this velocity set, the zeroth- and first-order cumulants correspond to density and momentum which are conserved during collisions, so their relaxation rates can be set to zero. We will specify one common relaxation rate $ \omega $ for the three second-order cumulants to ensure the correct viscosity of the fluid; the higher-order cumulants will be set to their equilibrium values which correspond to a relaxation rate of 1.

Code Generation in Python

We start by defining some general parameters as well as SymPy symbols and pystencils fields for later usage. Those include the stencil (D2Q9), the memory layout (fzyx, see Tutorial - Code Generation 1: Explicit Solver for the Heat Equation) and the symbol $ \omega $ for the relaxation rate of the second-order cumulants.

For the stream-pull-collide type kernel, we need two PDF fields which we set up using pystencils. We will use the built-in field swapping functionality of generate_sweep. It will generate the pointer swapping for the temporary destination field with the source field after each time step. This requires us to obey a specific naming convention: The temporary field must have the same field name (passed as a string to ps.fields) as the original field but with a suffix _tmp. Otherwise, an error will occur during compilation.

For VTK output and the initial velocity setup, we define a velocity vector field as an output field for the LB method.

stencil = LBStencil(Stencil.D2Q9)
omega = sp.Symbol('omega')
layout = 'fzyx'
 
#   PDF Fields
pdfs, pdfs_tmp = ps.fields(f'pdfs({stencil.Q}), pdfs_tmp({stencil.Q}): [2D]', layout=layout)
 
#   Velocity Output Field
velocity = ps.fields(f"velocity({stencil.D}): [2D]", layout=layout)
output = {'velocity': velocity}
 
#   Optimization
lbm_opt = LBMOptimisation(cse_global=True,
                          symbolic_field=pdfs,
                          symbolic_temporary_field=pdfs_tmp,
                          field_layout=layout)

We set up the cumulant-based MRT method with relaxation rates as described above. We use generate_lb_update_rule from lbmpy to derive the set of equations describing the collision operator together with the pull streaming pattern. These equations define the entire LBM sweep.

lbm_config = LBMConfig(stencil=stencil,
                       method=Method.CUMULANT,
                       relaxation_rate=omega,
                       compressible=True,
                       output=output)
 
lbm_update_rule = create_lb_update_rule(lbm_config=lbm_config, lbm_optimisation=lbm_opt)
 
 
lbm_method = lbm_update_rule.method

In Tutorial - Code Generation 2: Lattice Model Generation with lbmpy, we were able to use the framework built around the waLBerla lattice model template API for setting up the shear flow's initial velocity profile. Since we are not using a lattice model class this time, this API is not available to us. With lbmpy, though, we can generate a kernel which takes in scalar values or fields for the initial density and velocity and sets the initial PDF values to the corresponding equilibrium. The function macroscopic_values_setter from lbmpy.macroscopic_value_kernels returns a set of assignments for this initialization procedure. It takes the LB method definition as an argument, as well as either symbols or pystencils field accesses for the initial density rho and the initial velocity. Lastly, it takes the PDF field's centre vector as the destination for the PDF values. We define a separate symbol for the density and use the velocity field defined above.

initial_rho = sp.Symbol('rho_0')
 
pdfs_setter = macroscopic_values_setter(lbm_method,
                                        initial_rho,
                                        velocity.center_vector,
                                        pdfs.center_vector)

Everything is now prepared to generate the actual C++ code. We create the code generation context and evaluate the ctx.gpu flag to find out if waLBerla is configured to build GPU code. If CUDA/HIP is enabled, we set the target to gpu; otherwise to cpu. The target is then passed to all code generation functions. If GPU code is to be generated, the generated classes will be implemented in *.cu files for CUDA device code or *.cpp for HIP device code, and their sweeps will run on the GPU.

Several functions from pystencils_walberla and lbmpy_walberla are called to generate the classes:

The LBM sweep is generated by the lbm_update_rule equations using generate_sweep. This function takes an additional parameter field_swaps which takes a list of pairs of fields to generate the pointer swapping.
The communication pack info is generated using generate_pack_info_from_kernel which infers from the update rule's write accesses the PDF indices that need to be communicated. Without further specification, it assumes a pull-type kernel.
The PDF initialization kernel is generated from the pdfs_setter assignment collection using generate_sweep.
Using generate_boundary, we generate an optimised implementation of a NoSlip boundary handler for the domain's walls.

with CodeGeneration() as ctx:
    target = ps.Target.GPU if ctx.gpu else ps.Target.CPU
 
    #   LBM Sweep
    generate_sweep(ctx, "CumulantMRTSweep", lbm_update_rule, field_swaps=[(pdfs, pdfs_tmp)], target=target)
 
    #   Pack Info
    generate_pack_info_from_kernel(ctx, "CumulantMRTPackInfo", lbm_update_rule, target=target)
 
    #   Macroscopic Values Setter
    generate_sweep(ctx, "InitialPDFsSetter", pdfs_setter, target=target)
 
    #   NoSlip Boundary
    generate_boundary(ctx, "CumulantMRTNoSlip", NoSlip(), lbm_method, target=target)

As in Tutorial - Code Generation 2: Lattice Model Generation with lbmpy, the classes generated by the above code need to be registered with CMake using the walberla_generate_target_from_python macro. Since the source file extension for device code can be different we use the macro CODEGEN_FILE_SUFFIX. This macro essentially switches to *.cu only if CUDA is used. During the build process, the correct target is selected through the surrounding if(WALBERLA_BUILD_WITH_GPU_SUPPORT) block, which makes the application depend on gpu. This referees to the gpu files in waLBerla.

The waLBerla application

We will now integrate the generated classes into a waLBerla application. After adding the code generation target as a CMake dependency, we can include their header files:

#include "CumulantMRTNoSlip.h"
#include "CumulantMRTPackInfo.h"
#include "CumulantMRTSweep.h"
#include "InitialPDFsSetter.h"

We set up typedef aliases for the generated pack info and the D2Q9 stencil. For the PDF and velocity fields, we use instances of the field::GhostLayerField template. The number of entries of the PDF field is specified by the Stencil_T::Size parameter. As our domain is two-dimensional, the velocity at each lattice node is a two-dimensional vector. Thus, we set up the velocity field to have two index dimensions passing the stencil's dimension as a template parameter. Finally, we also define a typedef alias for our generated NoSlip boundary.

// Communication Pack Info
typedef pystencils::CumulantMRTPackInfo PackInfo_T;
 
// LB Method Stencil
typedef stencil::D2Q9 Stencil_T;
 
// PDF field type
typedef field::GhostLayerField< real_t, Stencil_T::Size > PdfField_T;
 
// Velocity Field Type
typedef field::GhostLayerField< real_t, Stencil_T::D > VectorField_T;
 
// Boundary Handling
typedef walberla::uint8_t flag_t;
typedef FlagField< flag_t > FlagField_T;
typedef lbm::CumulantMRTNoSlip NoSlip_T;

The PDF field, the velocity field and the flag field which stores information about the cell types for boundary handling are set up using field::addToStorage. The memory layout fzyx is explicitly set for the PDF and velocity fields. We set initial values of 0.0 for both of them since they will be explicitly initialised later on.

Using the generated sweep and pack info

The generated pack info class can be used in exactly the same way as a waLBerla-native pack info. It simply needs to be registered with the communication scheme by calling addPackInfo:

blockforest::communication::UniformBufferedScheme< Stencil_T > communication(blocks);

communication.addPackInfo(make_shared< PackInfo_T >(pdfFieldId));

Upon construction, the generated sweep class requires the PDF field, the velocity field and the relaxation rate omega. The parameter omega is read from a parameter file. After creation, the sweep functor can be passed directly to the timeloop.

timeloop.add() << Sweep(pystencils::CumulantMRTSweep(pdfFieldId, velocityFieldId, omega));

Adding the NoSlip Boundary

The generated NoSlip boundary works independently of the native waLBerla LBM boundary handling classes, which are templated for the use with lattice models. Still, we use the functions exposed by the geometry module to initialise the flag field from the parameter file. The NoSlip boundary handler is constructed passing the block grid and the PDF field's identifier. After the flag field has been set up, the fillFromFlagField method is called to register the boundary cells with the boundary handler.

const FlagUID fluidFlagUID("Fluid");
 
auto boundariesConfig = walberlaEnv.config()->getOneBlock("Boundaries");
 
NoSlip_T noSlip(blocks, pdfFieldId);
 
geometry::initBoundaryHandling< FlagField_T >(*blocks, flagFieldId, boundariesConfig);
geometry::setNonBoundaryCellsToDomain< FlagField_T >(*blocks, flagFieldId, fluidFlagUID);
 
noSlip.fillFromFlagField< FlagField_T >(blocks, flagFieldId, FlagUID("NoSlip"), fluidFlagUID);

The generated boundary handler holds an internal list of tuples (x, y, dir), the so-called index vector. Each entry corresponds to a fluid cell which is a direct neighbour of a boundary cell. The direction pointing towards the boundary cell and the position is saved for each of these fluid cells. In our scenario, the north and south domain borders are marked as NoSlip-boundaries. The boundary cells are located on the northern and southern ghost layers, so the top and bottom row of fluid cells are neighbours to these boundaries. Thus, in the index vector, for all cells of the bottom row the tuples (x, 0, 2), (x, 0, 7) and (x, 0, 8) are saved. The indices 2, 7 and 8 correspond to the South, South-East and South-West directions of the D2Q9 lattice, respectively. The boundary sweep then runs over this index vector, reflecting the outgoing populations in the stored directions off the wall according to the NoSlip rule.

The NoSlip sweep can now be added to the timeloop before the LBM sweep. We also register the communication functor as a BeforeFunction of the boundary sweep, since it needs to be run first in every time step.

timeloop.add() << BeforeFunction(communication, "communication") << Sweep(noSlip);

Initialisation of the Velocity Profile

Setting up the shear flow's initial velocity profile consists of two parts. First, the velocity field is initialised to the shear flow. Second, the prepared velocity field is passed to the generated PDF field setter, which then initialises the equilibrium PDF values from the velocities. The density is uniformly set to unity.

The initShearFlowVelocityField function takes a pointer to the block storage, the velocity field's identifier and a handle to a parameter file block. It then reads the magnitudes of the velocity's $x$-component and the perturbation in the $y$-direction. For the random noise, a random number generator is created with a seed specified in the parameter file. Then, the function iterates over all blocks (distributed to MPI processes) and cells. A cell's velocity is then set according to its global position.

void initShearFlowVelocityField(const shared_ptr< StructuredBlockForest >& blocks,
                                const BlockDataID& velocityFieldId,
                                const Config::BlockHandle& config)
{
   math::RealRandom< real_t > rng(config.getParameter< std::mt19937::result_type >("noiseSeed", 42));
 
   real_t velocityMagnitude = config.getParameter< real_t >("velocityMagnitude", real_c(0.08));
   real_t noiseMagnitude    = config.getParameter< real_t >("noiseMagnitude", real_c(0.1) * velocityMagnitude);
 
   real_t n_y = real_c(blocks->getNumberOfYCells());
 
   for (auto blockIt = blocks->begin(); blockIt != blocks->end(); ++blockIt)
   {
      auto u = (*blockIt).getData< VectorField_T >(velocityFieldId);
 
      for (auto cellIt = u->beginWithGhostLayerXYZ(); cellIt != u->end(); ++cellIt)
      {
         Cell globalCell(cellIt.cell());
         blocks->transformBlockLocalToGlobalCell(globalCell, *blockIt);
 
         real_t relative_y = real_c(globalCell.y()) / n_y;
 
         u->get(cellIt.cell(), 0) = relative_y < 0.3 || relative_y > 0.7 ? velocityMagnitude : -velocityMagnitude;
 
         u->get(cellIt.cell(), 1) = noiseMagnitude * rng();
      }
   }
}

After the velocity field has been initialised, the generated InitialPDFsSetter sweep is created and run once on each block. After this step, the PDF field will be initialised with the correct equilibrium distributions for the first iteration.

// Velocity field setup
auto shearFlowSetup = walberlaEnv.config()->getOneBlock("ShearFlowSetup");
initShearFlowVelocityField(blocks, velocityFieldId, shearFlowSetup);
 
real_t rho = shearFlowSetup.getParameter("rho", real_c(1.0));
 
// pdfs setup
pystencils::InitialPDFsSetter pdfSetter(pdfFieldId, velocityFieldId, rho);
 
for (auto blockIt = blocks->begin(); blockIt != blocks->end(); ++blockIt)
{
   pdfSetter(&(*blockIt));
}

The simulation is now ready to run.

Differences in the GPU application

If GPU_SUPPORT is enabled, some implementation details need to be different from a CPU-only version. This mainly concerns the creation and management of fields, MPI communication and VTK output. Since the initialisation, LBM and NoSlip sweeps run entirely on the GPU, the PDF field has to be set up only in graphics memory. In contrast to that is the velocity field required by CPU and GPU. The shear flow velocity profile is constructed by CPU code before the initialisation kernel maps it onto the PDF field on the GPU. Also, the VTK output routines which run on the CPU need to read the velocity field. It thus needs to be created twice: Once in the main memory, and once in GPU memory. It is then copied on-demand from the GPU to the CPU. Furthermore, we create a flag field, which is only needed on the CPU. After the initialisation, we use it to create the index-vectors for the boundary-handling. The index vectors are then transferred to the GPU and not the entire flag field.

For the largest part, though, the C++ code is identical. The code snippets presented above represent only the CPU variant of the code. The GPU implementation can be found in the source file 03_AdvancedLBMCodegen.cpp. There, code blocks which are different from the CPU to the GPU implementation are toggled via preprocessor conditionals.

Conclusion and Outlook

We have now successfully implemented a waLBerla LBM simulation application with an advanced collision operator, which can be specialised for both CPU and GPU hardware targets. Other possible extensions would be the use of advanced streaming patterns like the AA-pattern or EsoTwist to reduce the simulation's memory footprint. Also, lbmpy gives us the tools to develop advanced lattice Boltzmann methods for many kinds of applications. Thus, the basic principles demonstrated in these tutorials can be used for creating more complicated simulations with optimised lattice Boltzmann code.

Table of Contents