Introduction

Welcome to this course about high-performance numerical computing with Rust on GPU!

This course builds upon the companion course on CPU computing, and is meant to directly follow it. Basic concepts of the Rust programming language will therefore not be introduced again. Instead, we will see how these concepts can be leveraged to build high-performance GPU computations, using the Vulkan API via the vulkano high-level Rust binding.

These are rather uncommon technological choices in scientific computing, so you may wonder why they were chosen. Rust ecosystem support aside, Vulkan was picked as one of few GPU APIs that manage to avoid the classic design flaws of HPC-centric GPU APIs:

Numerical computations should aim for maximal portability by default. Nonportable programs are the open-air landfills of HPC: they may seem initially convenient, but come with huge hidden costs and leave major concerns up to future generations.¹
CPU/GPU performance portability doesn’t work. Decades of research have produced nothing but oversized frameworks of mind-boggling complexity where either CPU or GPU performance does not even remotely match that of well-optimized code on non-toy programs. A GPU-first API can be conceptually simpler, more reliable, and ease optimization; all this saves enough time to let you write a good CPU version of your computational kernel if you need one.
Proprietary API emulation or imitation doesn’t work. Because the monopoly manufacturer controls the API and has much greater software development resources, all other hardware will always be a second-class citizen with lagging support, unstable runtimes, and poor support of advanced hardware features that the monopoly manufacturer didn’t implement.
Relying on hardware manufacturer good will doesn’t work. Monopoly manufacturers will not help you write code that works on other hardware, and minority hardware manufacturers have little resources to dedicate to obscure HPC portability technologies with low adoption. It is more effective to force the manufacturers’ hand by basing your work on a widely adopted technology whose reach extends far beyond the relatively small HPC community.

As for the vulkano Rust binding specifically, the choice came down to general maturity, maintenance status, broad Vulkan API coverage, high-quality documentation, ease of installation and good alignment with the Rust design goals of making code type/memory/thread-safe by default.

Pedagogical advice given in the introduction of the CPU course still applies:

This course is meant to be followed in order, environment setup section aside. Each sections will build upon the concepts taught and the exercise work done in earlier sections.
The material is written to allow further self-study after the school, so it’s okay to fall a little behind the group. Good understanding is more important than full chapter coverage.
Solutions to some exercises are provided in the top commits of the solution branch of the repository. To keep the course material maintainable, these only cover exercises where there is one obvious solution, not open-ended problems where you could go down many paths.

As in the CPU course, you can navigate between the course’s sections using several tools:

The left-hand sidebar, which provides direct access to every page.
- If your browser window is thin, the sidebar may be hidden by default. In that case you can open (and later close) it using the top-left “triple dash” button.
The left/right arrow buttons at the end of each page, or your keyboard’s arrow keys.

Problems linked to nonportable code include lack of future computation reproducibility, exploding hardware costs, reduced hardware innovation, and ecosystem fragility against unforeseen changes of politics like the ongoing race to the bottom in computational precision. ↩

Environment setup

The development environment for this course will largely extend that of the CPU course. You should therefore begin by following the environment setup process for the CPU course if you have not done so already, including the final test which makes sure that your Rust development environment does work as expected for CPU programming purposes.

Once this is done, we will proceed to extend this CPU development environment into a GPU development environment by going through the following steps:

Try to make your GPU(s) available for Vulkan development. This is the hardest part, but if this step fails it’s not the end of the world, we can use a GPU emulator instead.
Add Vulkan development tools to your Rust development environment.
Download and unpack this course’s version of the exercises/ source code.
Test that the resulting setup is complete by running some of the course’s code examples.

Host GPU setup

Before a Vulkan-based program can use your GPU, a few system preparations are needed:

Vulkan relies on GPU hardware features that were introduced around 2012. If your system’s GPUs are older than this, then you will almost certainly need to use a GPU emulator, and can ignore everything else that is said inside of this chapter.
Doing any kind of work with a GPU requires a working GPU driver. Which, for some popular brands of GPUs, may unfortunately require some work.
Doing Vulkan work specifically additionally requires a Vulkan implementation that knows how to communicate with your GPU driver.
- Some GPU drivers provide their own Vulkan implementation. This is common on Windows, but also seen in e.g. NVidia’s Linux drivers.
- Other GPU drivers expose a standardized interface that a third-party Vulkan implementations can tap into. This is the norm on Linux on macOS.

It is important to point out that you will also need these preparations when using Linux containers, because the containers do not acquire full control of the GPU hardware. They need to go through the host system’s GPU driver, which must therefore be working.

In fact, as a word of warning, containerized setups will likely make it harder for you to get a working GPU setup.¹ Given the option to do so, you should prefer using a native development environment for this course, or any other kind of coding that involves GPUs for that matter.

GPU driver

The procedure for getting a working GPU driver is, as you may imagine, fairly system-dependent. Please select your operating system using the tabs below:

macOS bundles suitable GPU drivers for all Apple-manufactured computers, and Macs should therefore require no extra GPU driver setup.²

After performing any setup step described above and rebooting, your system should have a working GPU driver. But owing to the highly system-specific nature of this step, we unfortunately won’t yet be able to check this in an OS-agnostic manner. To do that, we will install another component that you are likely to need for this course, namely a Vulkan implementation.

Vulkan implementation

As mentioned above, your GPU driver may or may not come with a Vulkan implementation. If that is not the case, we will want to install one.

Like Windows, macOS does not provide first-class Vulkan support out of the box because Apple want to push their own proprietary GPU API called Metal.

Unlike on Windows, however, there is no easy workaround based on installing the GPU manufacturer’s driver on macOS, because Apple is the manufacturer and unsurprisingly they do not provide an optional driver with Vulkan support either.

What we will therefore need to do is to layer a third-party Vulkan implementation on top of Apple’s proprietary Metal API. The MoltenVk project provides the most popular implementation of such a layered Vulkan implementation at the time of writing.

As the author sadly did not get the chance to experiment with a Mac during preparation of the course, we cannot provide precise installation instructions for MoltenVk. So please follow the installation instructions of the README file of the official code repository and ping the course author if you run into any trouble.

Given this preparation, your system should now be ready to run Vulkan apps that use your GPU. How do we know for sure, however? A test app will come in handy here.

Final check

The best way to check if your Vulkan setup works is to run a Vulkan application that can display a list of available devices and make sure that your GPUs are featured in that list.

The Khronos Group, which maintains the Vulkan specification, provide a simple tool for this in the form of the vulkaninfo app, which prints a list of all available devices along with their properties. And for once, planets have aligned properly and all package managers in common use have agreed to name the package that contains this app identically. No matter if you use a Linux distribution’s built-in package manager, brew for macOS, or vcpkg for Windows, the package that contains this utility is called vulkan-tools on every system that the author could think about.

There is just one problem: Vulkan devices have many properties, which means that the default level of detail displayed by vulkaninfo is unbrearable. For example, it emits more than 6000 lines of textual output on the author’s laptop at the time of writing.

Thankfully there is an easy fix for that: add the --summary command line option, and you will get a reasonably concise device list at the end of the output. Here’s the output from the author’s laptop:

vulkaninfo --summary

[ ... global Vulkan implementation properties ... ]

Devices:
========
GPU0:
        apiVersion         = 1.4.311
        driverVersion      = 25.1.4
        vendorID           = 0x1002
        deviceID           = 0x1636
        deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName         = AMD Radeon Graphics (RADV RENOIR)
        driverID           = DRIVER_ID_MESA_RADV
        driverName         = radv
        driverInfo         = Mesa 25.1.4-arch1.1
        conformanceVersion = 1.4.0.0
        deviceUUID         = 00000000-0800-0000-0000-000000000000
        driverUUID         = 414d442d-4d45-5341-2d44-525600000000
GPU1:
        apiVersion         = 1.4.311
        driverVersion      = 25.1.4
        vendorID           = 0x1002
        deviceID           = 0x731f
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = AMD Radeon RX 5600M (RADV NAVI10)
        driverID           = DRIVER_ID_MESA_RADV
        driverName         = radv
        driverInfo         = Mesa 25.1.4-arch1.1
        conformanceVersion = 1.4.0.0
        deviceUUID         = 00000000-0300-0000-0000-000000000000
        driverUUID         = 414d442d-4d45-5341-2d44-525600000000
GPU2:
        apiVersion         = 1.4.311
        driverVersion      = 25.1.4
        vendorID           = 0x10005
        deviceID           = 0x0000
        deviceType         = PHYSICAL_DEVICE_TYPE_CPU
        deviceName         = llvmpipe (LLVM 20.1.6, 256 bits)
        driverID           = DRIVER_ID_MESA_LLVMPIPE
        driverName         = llvmpipe
        driverInfo         = Mesa 25.1.4-arch1.1 (LLVM 20.1.6)
        conformanceVersion = 1.3.1.1
        deviceUUID         = 6d657361-3235-2e31-2e34-2d6172636800
        driverUUID         = 6c6c766d-7069-7065-5555-494400000000

As you can see, this particular system has three Vulkan devices available:

An AMD GPU that’s integrated into the same package as the CPU (low-power, low-performance)
Another AMD GPU that is separated from the CPU aka discrete (high-power, high-performance)
A GPU emulator called llvmpipe that is useful for debugging, and as a fallback for systems where there is no easy way to get a real hardware GPU to work (e.g. continuous integration of software hosted on GitHub or GitLab).

If you see all the Vulkan devices that you expect in the output of this command, that’s great! You are done with this chapter and can move to the next one. Otherwise, please go through this page’s instructions slowly again, making sure that you have not forgotten anything, and if not ping the teacher and we’ll try to figure it out together.

In addition to a working GPU driver on the host sytem and a working Vulkan stack inside of the container, you need to have working communication between the two. This assumes that they are compatible, which is anything but a given when e.g. running Linux containers on Windows or macOS. It also doesn’t help that most container runtimes are designed to operate as a black box (with few ways for users to observe and control the inner machinery) and attempt to sandbox containers (which may prevent them from getting access to the host GPU in the default container runtime configuration). ↩
Unless you are using an exotic configuration like an old macOS release running on a recent computer, that is, but if you know how to get yourself into this sort of Apple-unsupported configuration, we trust you to also know how to keep its GPU driver working… :) ↩
NVidia’s GPU drivers have historically tapped into unstable APIs of the Linux kernel that may change across even bugfix kernel releases, and this makes them highly vulnerable to breakage across system updates. To make matters worse, their software license would also prevent Linux distributions from shipping these drivers into their official software repository, which prevented distributions from enforcing kernel/driver compatibility at the package manager level. The situation has recently improved for newer hardware (>= Turing generation), where a new “open-source driver” (actually a thin open-source layer over an enormous encrypted+signed binary blob running on a hidden RISC-V CPU because this is NVidia) has been released with a license that enables distributions to ship it as a normal package. ↩
The unfortunate popularity of “stable” distributions like Red Hat Enterprise or Ubuntu LTS, which take pride in embalming ancient software releases and wasting thousands of developer hours into backporting bugfixes from newer releases, make this harder than it should be. But when an old kernel gets in the way of hardware support and a full distribution upgrade is not an option, consider upgrading the kernel alone using facilities like Ubuntu’s “HardWare Enablement” (-hwe) kernel packages. ↩

Development tools

The Rust development environment that was set up for the CPU computing course contains many things that are also needed for this GPU computing course too. But we are also going to need a few other things that are specific to this course. To be more specific…

If you previously used containers, you must first switch to another container (based on the CPU one) that features Vulkan development tools. Then you can adjust your container’s execution configuration to expose host GPUs to the containerized Linux system.
If you previously performed a native installation, then you must install Vulkan development tools alongside the Rust development tools that you already have.

Linux containers

Switching to the new source code

As you may remember, when setting up your container for the CPU course, you started by downloading and unpacking an archive which contains a source code directory called exercises/.

We will do mostly the same for this course, but the source code will obviously be different. Therefore, please rename your previous exercises directory to something else (or switch to a different parent directory), then follow the following instructions.

Provided that the curl and unzip utilities are installed, you can download and unpack the source code in the current directory using the following sequence of Unix commands:

if [ -e exercises ]; then
    echo "ERROR: Please move or delete the existing 'exercises' subdirectory"
else
    curl -LO https://numerical-rust-gpu-96deb7.pages.in2p3.fr/setup/exercises.zip  \
    && unzip exercises.zip  \
    && rm exercises.zip
fi

Get a PowerShell terminal, then cd into the place where you would like to download the exercises’ source code and run the following script:

Invoke-Command -ScriptBlock {
      $ErrorActionPreference="Stop"
      if (Test-Path exercises) {
            throw "ERROR: Please move or delete the existing 'exercises' subdirectory"
      }
      Invoke-WebRequest https://numerical-rust-gpu-96deb7.pages.in2p3.fr/setup/exercises.zip  `
                        -OutFile exercises.zip
      Expand-Archive exercises.zip -DestinationPath .
      Remove-Item exercises.zip
}

Switching to the GPU image

During the CPU course, you have used a container image with a name that has numerical-rust-cpu in it, such as gitlab-registry.in2p3.fr/grasland/numerical-rust-cpu/rust_light:latest. It is now time to switch to another version of this image that has GPU tooling built into it.

If you used the image directly, that’s easy, just replace cpu with gpu in the image name and all associated container execution commands that you use. In the above example, you would switch gitlab-registry.in2p3.fr/grasland/numerical-rust-gpu/rust_light:latest.
If you built a container image of your own on top of the course’s image, then you will have a bit more work to do, in the form of replaying your changes on top of your new images. Which shouldn’t be too hard either… if you used a proper Dockerfile instead of raw docker commit.

But unfortunately, that’s not the end of it. Try to run vulkaninfo --summary inside of the resulting container, and you will likely figure out that some of your host GPUs are likely not visible inside of the container. If that’s the case, then I have bad news for you: you have some system-specific work to do if you want to be able to use your GPUs inside of the container.

Exposing host GPUs

Please click the following tab that best describes your host system for further guidance:

In the host setup section, we mentioned that NVidia’s Linux drivers use a monolithic design. Their GPU kernel driver and Vulkan implementation are packaged together in such a way that the Vulkan implementation is only guaranteed to work if paired with the exact GPU kernel driver from the same NVidia driver package version.

As it turns out, this design is not just unsatisfying from a software engineering best practices perspective. It also becomes an unending source of pain as soon as containers get involved.

A first problem is that NVidia’s GPU driver resides in the Linux kernel while the Vulkan driver is implemented as a user-space library. Whereas the whole idea of Linux containers is to keep the host’s kernel while replacing the userspace libraries and executables with those of a different Linux system. And unless the Linux distribution of the host and containerized systems are the same, the odds that they will use the exact same NVidia driver package version are low.

To work around this, many container runtimes provide an option called --gpus (Docker, Podman) or --nv (Apptainer, Singularity) that lets you mount a bunch of files from the user-space components of the NVidia driver of the host system.

This is pretty much the only way to get the NVidia GPU driver to work inside of a container, but it comes at a price: GPU programs inside of the container will be exposed to NVidia driver binaries that were not the ones that they were compiled and tested against, and which they may or may not be compatible with. In that sense, those container runtime options undermine the basic container promise of executing programs in a well-controlled environment.

To make matters worse, the NVidia driver package actually contains not just one, but two different Vulkan backends. One that is specialized towards X11 graphical environments, and another that works in Wayland and headless environment. As bad luck would have it, the backend selection logic gets confused by the hacks needed to get the NVidia driver to work inside of a Linux container, and wrongly selects the X11 backend. Which won’t work as this course’s containers do not have even a semblance of an X11 graphics rendering stack, because they don’t need one.

That second issue can be fixed by modifying an environment variable to override the NVidia Vulkan implementation’s default backend selection logic and select the right one. But that will come at the expense of losing support for every other GPU on the system including the llvmpipe GPU emulator. As this is a high-performance computing course, and NVidia GPUs tend to be more powerful than any other GPU featured in the same system, we will consider this as an acceptable tradeoff.

Putting it all together, adding the following command-line option to your docker/podman/apptainer/singularity run commands should allow you to use your host’s NVidia GPUs from inside the resulting container:

--gpus=all --env VK_ICD_FILENAMES=/usr/share/glvnd/egl_vendor.d/10_nvidia.json

--nv --env VK_ICD_FILENAMES=/usr/share/glvnd/egl_vendor.d/10_nvidia.json

New command line arguments and container image name aside, the procedure for starting up a container will be mostly identical to that used for the CPU course. So you will want to get back to the appropriate section of the CPU course’s container setup instructions and follow the instructions for your container and system configuration again.

Once that is done, please run vulkaninfo --summary inside of a shell within the container and check that the Vulkan device list matches what you get on the host, driver version details aside.

Testing your setup

Your Rust development environment should now be ready for this course’s practical work. I strongly advise testing it by running the following script:

curl -LO https://gitlab.in2p3.fr/grasland/numerical-rust-gpu/-/archive/solution/numerical-rust-gpu-solution.zip  \
&& unzip numerical-rust-gpu-solution.zip  \
&& rm numerical-rust-gpu-solution.zip  \
&& cd numerical-rust-gpu-solution/exercises  \
&& echo "------"  \
&& cargo run --release --bin info -- -p  \
&& echo "------"  \
&& cargo run --release --bin square -- -p  \
&& cd ../..  \
&& rm -rf numerical-rust-gpu-solution

It performs the following actions, whose outcome should be manually checked:

Run a Rust program that should produce the same device list as vulkaninfo --summary. This tells you that any device that gets correctly detected by a C Vulkan program also gets correctly detected by a Rust Vulkan program, as one would expect.
Run another program that uses a simple heuristic to pick the Vulkan device that should be most performant, then uses that device to square an array of floating-point numbers, then checks the results. You should make sure the device selection that this program made is sensible and its final result check passed.
If everything went well, the script will clean up after itself by deleting all previously created files.

Native installation

While containers are often lauded for making it easier to reproduce someone else’s development environment on your machine, GPUs actually invert this rule of thumb. As soon GPUs get involved, it’s often easier to get something working with a native installation.

The reason why that is the case is that before we get any chance of having a working GPU setup inside of a container, we must first get a working GPU setup on the host system. And once you have taken care of that (which is often the hardest part), getting the rest of a native development environment up and running is not that much extra work.

As before, we will will assume that you have already taken care of setting up a native development environment for Rust CPU development, and this documentation will therefore only focus on the changes needed to get this setup ready for native Vulkan development. Which will basically boil down to installing a couple of Vulkan development tools.

Vulkan validation layers

Vulkan came in a context where GPU applications were often bottlenecked by API overheads, and one of its central design goals was to improve upon that. A particularly controversial decision taken then was to remove mandatory parameter validation from the API, instead making it undefined behavior to pass any kind of unexpected parameter value to a Vulkan function.

This may be amazing for run-time performance, but certainly does not result in a great application development experience. Therefore it was also made possible to bring such checks back as an optional “validation” layer, that is meant to be used during application development and later removed in production. As a bonus, because this layer was only meant for development purposes and operated under no performance constraint, it could also…

Perform checks that are much more detailed than those that any GPU API performed before, finding more errors in GPU-side code and CPU-GPU synchronization patterns.
Supplement API usage error reporting with more opinionated “best practices” and “performance” lints that are more similar to compiler warnings in spirit.

Because this package is meant to be used for development purposes, it is not a default part of Vulkan installations. Thankfully, all commonly used systems have a package for that:

Debian/Ubuntu/openSUSE/Brew: vulkan-validationlayers
Arch/Fedora/RHEL: vulkan-validation-layers
Windows: Best installed as part of the LunarG Vulkan SDK

`shaderc`

Older GPU APIs relied on GPU drivers to implement a compiler for a C-like language, which proved to be a bad idea as GPU manufacturers are terrible compiler developers (and terrible software developers in general). Applications thus experienced constant issues linked to those compilers, from uneven performance across hardware to incorrect run-time program behavior.

To get rid of this pain, Vulkan has switched to an AoT/JiT hybrid compilation model where GPU code is first compiled into a simplified assembly-like interpreted representation called SPIR-V on the developer’s machine, and it is this intermediate representation that gets sent to the GPU driver for final compilation into a device- and driver-specific binary.

Because of this, our development setup is going to require a compiler that goes from the GLSL domain-specific language (which is a common choice for GPU code, we’ll get into why during the course) to SPIR-V. The vulkano Rust binding that we use is specifically designed to use shaderc, which is a compiler that is maintained by the Android development team.

Unfortunately, shaderc is not packaged by all Linux distributions. You may therefore need to either use the official binaries or build it from source. In the latter case, you are going to need…

CMake
Ninja
C and C++ compilers
Python
git

…and once those dependencies are available, you should be able to build and install the latest upstream-tested version of shaderc and its dependencies using the following script:

git clone --branch=known-good https://github.com/google/shaderc  \
&& cd shaderc  \
&& ./update_shaderc_sources.py  \
&& cd src  \
&& ./utils/git-sync-deps  \
&& mkdir build  \
&& cd build  \
&& cmake -GNinja -DCMAKE_BUILD_TYPE=Release ..  \
&& make -j$(nproc)  \
&& ctest -j$(nproc)  \
&& make install  \
&& cd ../../..  \
&& rm -rf shaderc

Whether you download binaries or build from source, the resulting shaderc installation location will likely not be in the default search path of the associated shaderc-sys Rust bindings. We will want to fix this, otherwise the bindings will try to be helpful by automatically downloading and building an internal copy of shaderc insternally. This may fail if the dependencies are not available, and is otherwise inefficient as such a build will need to be performed once per project that uses shaderc-sys and again if the build directory is ever discarded using something like cargo clean.

To point shaderc-sys in the right direction, find the directory in which the libshaderc_combined static library was installed (typically some variation of /usr/local/lib when building from source on Unix systems). Then adjust your Rust development environment’s configuration so that the SHADERC_LIB_DIR environment variable is set to point to this directory.

Syntax highlighting

For an optimal GPU development experience, you will want to set up your code editor to apply GLSL syntax highlighting to files with a .comp extension. In the case of Visual Studio Code, this can be done by installing the slevesque.shader extension.

Testing your setup

Your Rust development environment should now be ready for this course’s practical work. I strongly advise testing it by running the following script:

curl -LO https://gitlab.in2p3.fr/grasland/numerical-rust-gpu/-/archive/solution/numerical-rust-gpu-solution.zip  \
&& unzip numerical-rust-gpu-solution.zip  \
&& rm numerical-rust-gpu-solution.zip  \
&& cd numerical-rust-gpu-solution/exercises  \
&& echo "------"  \
&& cargo run --release --bin info -- -p  \
&& echo "------"  \
&& cargo run --release --bin square -- -p  \
&& cd ../..  \
&& rm -rf numerical-rust-gpu-solution

Invoke-Command -ScriptBlock {
      $ErrorActionPreference="Stop"

      Invoke-WebRequest https://gitlab.in2p3.fr/grasland/numerical-rust-gpu/-/archive/solution/numerical-rust-gpu-solution.zip  `
                        -OutFile solution.zip
      Expand-Archive solution.zip -DestinationPath .
      Remove-Item solution.zip
      Set-Location numerical-rust-gpu-solution/exercises

      Write-Output "------"
      cargo run --release --bin info -- -p

      Write-Output "------"
      cargo run --release --bin square -- -p

      Set-Location ..\..
      Remove-Item numerical-rust-gpu-solution -Recurse
}

It performs the following actions, whose outcome should be manually checked:

Run a Rust program that should produce the same device list as vulkaninfo --summary. This tells you that any device that gets correctly detected by a C Vulkan program also gets correctly detected by a Rust Vulkan program, as one would expect.
Run another program that uses a simple heuristic to pick the Vulkan device that should be most performant, then uses that device to square an array of floating-point numbers, then checks the results. You should make sure the device selection that this program made is sensible and its final result check passed.
If everything went well, the script will clean up after itself by deleting all previously created files.

Training-day instructions

Expectations and conventions

Welcome to this practical about high-performance GPU computing in Rust!

This course is meant to follow the previous one, which is about CPU computing. It is assumed that you have followed that course, and therefore we will not repeat anything that was said there. However, if your memory is hazy and you are unsure about what a particular construct in the Rust code examples does, please ping the teacher for guidance.

Although some familiarity with Rust CPU programming is assumed, no particular GPU programming knowledge is expected beyond basic knowledge of GPU hardware architecture. Indeed, the GPU API that we will use (Vulkan) is different enough from other (CUDA- or OpenMP-like) APIs that are more commonly used in HPC that knowledge of those APIs may cause extra confusion. The course’s introduction explains why we are using Vulkan and not these other APIs like everyone else.

Exercises source code

At the time where you registered, you should have been directed to instructions for setting up your development environment. If you did not follow these instructions yet, this is the right time!

Now that the course has begun, we will download a up-to-date copy of the exercises’ source code and unpack it somewhere inside of your development environement. This will create a subdirectory called exercises/ in which we will be working during the rest of the course.

Please pick your environement below in order to get appropriate instructions:

From a shell inside of the container¹, run the following sequence of commands to update the exercises source code that you have already downloaded during container setup.

Beware that any change to the previously downloaded code will be lost in the process.

cd ~
# Can't use rm -rf exercises because we must keep the bind mount alive
for f in $(ls -A exercises); do rm -rf exercises/$f; done  \
&& curl -LO https://numerical-rust-gpu-96deb7.pages.in2p3.fr/setup/exercises.zip  \
&& unzip -u exercises.zip  \
&& rm exercises.zip  \
&& cd exercises

Get a PowerShell terminal, then cd into the place where you would like to download the exercises’ source code and run the following script:

Invoke-Command -ScriptBlock {
      $ErrorActionPreference="Stop"
      if (Test-Path exercises) {
            throw "ERROR: Please move or delete the existing 'exercises' subdirectory"
      }
      Invoke-WebRequest https://numerical-rust-cpu-96deb7.pages.in2p3.fr/setup/exercises.zip  `
                        -OutFile exercises.zip
      Expand-Archive exercises.zip -DestinationPath .
      Remove-Item exercises.zip
      Set-Location exercises
}

Get a shell in your intended develoment environement², then cd into the place where you would like to download the exercises’ source code and run the following script:

if [ -e exercises ]; then
    echo "ERROR: Please move or delete the existing 'exercises' subdirectory"
else
    curl -LO https://numerical-rust-gpu-96deb7.pages.in2p3.fr/setup/exercises.zip  \
    && unzip exercises.zip  \
    && rm exercises.zip  \
    && cd exercises
fi

General advice

Some exercises are based on code examples that are purposely incorrect. Therefore, if some code fails to build, it may not come from a mistake of the course author, but from some missing work on your side. The course material should explicitly point out when that is the case.

If you encounter any failure which does not seem expected, or if you otherwise get stuck, please call the trainer for guidance!

With that being said, let’s get started with actual Rust code. You can move to the next page, or any other page within the course for that matter, through the following means:

Left and right keyboard arrow keys will switch to the previous/next page. Equivalently, arrow buttons will be displayed at the end of each page, doing the same thing.
There is a menu on the left (not shown by default on small screen, use the top-left button to show it) that allows you to quickly jump to any page of the course. Note, however, that the course material is designed to be read in order.
With the magnifying glass icon in the top-left corner, or the “S” keyboard shortcut, you can open a search bar that lets you look up content by keywords.

If you’re using rust_code_server, this means using the terminal pane of the web-based VSCode editor. ↩
That would be a regular shell for a local Linux/macOS installation and a Windows Subsystem for Linux shell for WSL. ↩

Instance

Any API that lets developers interact with a complex system must strike a balance between flexibility and ease of use. Vulkan goes unusually far on the flexibility side of this tradeoff by providing you with many tuning knobs at every stage of an execution process that most other GPU APIs largely hide from you. It therefore requires you to acquire an unusually good understanding of the complex process through which a GPU-based program gets things done.

In the first part of this course, we will make this complexity tractable by introducing it piece-wise, in the context of a trivial GPU program that merely squares an array of floating-point numbers. In the second part of the course, you will then see that once these basic concepts of Vulkan are understood, they easily scale them up to the complexity of a full Gray-Scott reaction.

As a first step, this chapter will cover how you can load the Vulkan library from Rust, set up a Vulkan instance in a way that eases later debugging, and enumerate available Vulkan devices.

Introducing `vulkano`

The first step that we must take before we can use Vulkan in Rust code, is to link your program to a Vulkan binding. This is a Rust crate that handles the hard work of linking to the Vulkan C library and exposing a Rust layer on top of it so that your Rust code may interact with it.

In this course, we will use the vulkano crate for this purpose. This crate builds on top of the auto-generated ash crate, which closely matches the Vulkan C API with only minor Rust-specific API tweaks, by supplementing it with two layers of abstraction:

A low-level layer that re-exposes Vulkan types and functions in a manner that is more in line with Rust programmer expectations. For example, C-style free functions that operate on their first pointer parameter are replaced with Rust-style structs with methods.
A high-level layer that automates away some common operations (like sub-allocation of GPU memory allocations into smaller chunks) and makes as many operations as possible safe (no possibility for undefined behavior).

Crucially, this layering is fine-grained (done individually for each Vulkan object type) and transparent (any high-level object lets you access the lower-level object below it). As a result, if you ever encounter a situation where the high-level layer has made design choices that are not right for your use case, you are always able to drop down to a lower-level layer and do things your own way.

This means that anything you can do with raw Vulkan API calls, you can also do with vulkano. But vulkano will usually give you an alternate way to do things that is easier, fast/flexible enough for most purposes, and requires a lot less unsafe Rust code that must be carefully audited for memory/thread/type safety. For many applications, this is a better tradeoff than using ash directly.

The vulkano dependency has already been added to this course’s example code, but for reference, this is how you would add it:

# You do not need to type in this command, it has already been done for you
cargo add --no-default-features --features=macros vulkano

This adds the vulkano dependency in a manner that disables the x11 feature that enables X11 support. This feature is not needed for this course, where we are not rendering images to X11 windows. And it won’t work in this course’s Linux containers, which do not contain a complete X11 stack as this would unnecessarily increase download size.

We do, however, keep the macros features on, because we will need it in order to use the vulkano-shaders crate later on. We’ll discuss what this crate does and why we need it in a future chapter.

Loading the library

Now that we have the vulkano binding available, we can use it to load the Vulkan library. In principle, you could customize this loading process to e.g. switch between different Vulkan libraries, but in practice this is rarely needed because as we will see later Vulkan provides several tools to customize the behavior of the library.

Hence, for the purpose of this course, we will stick with the default vulkano library-loading method, which is appropriate for almost every Vulkan application:

use std::error::Error;
use vulkano::library::VulkanLibrary;

// Simplify error handling with type-erased errors
type Result<T> = std::result::Result<T, Box<dyn Error>>;

fn main() -> Result<()> {
    // Load the Vulkan library
    let library = VulkanLibrary::new()?;

    // ...

    Ok(())
}

Like all system operations, loading the library can fail if e.g. no Vulkan implementation is installed on the host system, and we need to handle that.

Here, we choose to do it the easy way by converting the associated error type into a type-erased Box<dyn Error> type that can hold all error types, and bubbling this error out of the main() function using the ? error propagation operator. The Rust runtime will then take care of displaying the error message and aborting the program with a nonzero exit code. This basic error handling strategy is good enough for the simple utilities that we will be building throughout this course.

Once errors are handled, we may query the resulting VulkanLibrary object. For example, we can…

Check which revision of the Vulkan specification is supported. This versioning allows the Vulkan specification to evolve by telling us which newer features can be used by our application.
Check which Vulkan extensions are supported. Extensions allow Vulkan to support features that do not make sense on every single system supported by the API, such as the ability to display visuals in X11 and Wayland windows on Linux.
Check which Vulkan layers are available. Layers are stackable plugins that customize the behavior of your Vulkan library without replacing it. For example, the popular VK_LAYER_KHRONOS_validation layer adds error checking to all Vulkan functions, allowing you to check your application’s debug builds without slowing down its release builds.

Once we have learned what we need to know, we can then proceed with the next setup step, which is to set up a Vulkan API instance.

Setting up an instance

An Vulkan Instance is configured from a VulkanLibrary by specifying a few things about our application, including which optional Vulkan library features we want to use.

For reasons that will soon become clear, we must set up an Instance before we can do anything else with the Vulkan API, including enumerating available devices.

While the basic process is easy, we will take a few detours along the way to set up some optional Vulkan features that will make our debugging experience nicer later on.

`vulkano` configuration primer

For most configuration work, vulkano uses a recuring API design pattern that is based on configuration structs, where most fields have a default value.

When combined with Rust’s functional struct update syntax, this API design allows you to elegantly specify only the parameters that you care about. Here is an example:

use vulkano::instance::{InstanceCreateInfo, InstanceCreateFlags};

let instance_info = InstanceCreateInfo {
    flags: InstanceCreateFlags::ENUMERATE_PORTABILITY,
    ..InstanceCreateInfo::application_from_cargo_toml()
};

The above instance configuration struct expresses the following intent:

We let the Vulkan implementation expose devices that do not fully conform with the Vulkan specification, but only a slightly less featureful “portability subset” thereof. This is needed for some exotic Vulkan implementations like MoltenVk, which layers on top of macOS’ Metal API to work around Apple’s lack of Vulkan support.
We let vulkano infer the application name and version from our Cargo project’s metadata, so that we do not need to specify the same information in two different places.
For all other fields of the InstanceCreateInfo struct, we use the default instance configuration, which is to provide no extra information about our app to the Vulkan implementation and to enable no optional features.

Most optional Vulkan instance features are about interfacing with your operating system’s display features for rendering visuals on screen and are not useful for the kind of headless computations that we are going to study in this course. However, there are two optional Vulkan debugging features that we strongly advise enabling on every platform that supports them:

If the VK_LAYER_KHRONOS_validation layer is available, then it is a good idea to enable it in your debug builds. This enables debugging features falling in the following categories, at a runtime performance cost:
- Error checking for Vulkan entry points, whose invalid usage normally results in instant Undefined Behavior. vulkano’s high level layer is already meant to prevent or report such incorrect usage, but unfortunately it is not immune to the occasional bug or limitation. It is thus good to have some defense-in-depth against UB in your debug builds before you try to report a GPU driver bug that later turns out to be a vulkano bug.
- “Best practices” linting which detects suspicious API usage that is not illegal per the spec but may e.g. cause performance issues. This is basically a code linter executing at run-time with full knowledge of the application state.
- Ability to use printf() in GPU code in order to easily investigate its state when it behaves unexpectedly, aka “Debug Printf”.
The VK_EXT_debug_utils extension lets you send diagnostic messages from the Vulkan implementation to your favorite log output (stderr, syslog…). I would advise enabling it for both debug and release builds, on all systems that support it.
- In addition to being heavily used by the aforementioned validation layer, these messages often provide invaluable context when you are trying to diagnose why an application refuses to run as expected on someone else’s computer.

Indeed, these two debugging features are so important that vulkano provides dedicated tooling for enabling and configuring them. Let’s look into that.

Validation layer

As mentioned above, the Vulkan validation layer has some runtime overhead and partially duplicates the functionality of vulkano’s safe API. Therefore, it is normally only enabled in in debug builds.

We can check if the program is built in debug mode using the cfg!(debug_assertions) expression. When that is the case, we will want to check if the VK_LAYER_KHRONOS_validation layer is available, and if so add it to the set of layers that we enable for our instance:

// Set up a blank instance configuration.
//
// For what we are going to do here, an imperative style will be more effective
// than the functional style shown above, which is otherwise preferred.
let mut instance_info = InstanceCreateInfo::application_from_cargo_toml();

// In debug builds...
if cfg!(debug_assertions)
   // ...if the validation layer is available...
   && library.layer_properties()?
             .any(|layer| layer.name() == "VK_LAYER_KHRONOS_validation")
{
    // ...then enable it...
    instance_info
        .enabled_layers
        .push("VK_LAYER_KHRONOS_validation".into());

    // TODO: ...and configure it
}

// TODO: Proceed with rest of instance configuration

Back in the Vulkan 1.0 days, simply enabling the layer like this would have been enough. But as the TODO above suggests, the validation layer have since acquired optional features which are not enabled by default, largely because of their performance impact.

Because we only enable the validation layer in debug builds, where runtime performance is not a big concern, we can enable as many of those as we like by pushing the appropriate flags into the enabled_validation_features member of our InstanceCreateInfo struct. The only limitation that we must respect in doing so is that GPU-assisted validation (which provides extended error checking) is incompatible with use of printf() in GPU code. For the purpose of this course, we will priorize GPU-assisted validation over GPU printf().

The availability of these fine-grained settings is signaled by support of the VK_EXT_validation_features layer extension.¹ We can detect this extension and enable it along with almost every feature except for GPU printf() using the following code:

use vulkano::instance::debug::ValidationFeatureEnable;

if library
    .supported_layer_extensions("VK_LAYER_KHRONOS_validation")?
    .ext_validation_features
{
    instance_info.enabled_extensions.ext_validation_features = true;
    instance_info.enabled_validation_features.extend([
        ValidationFeatureEnable::GpuAssisted,
        ValidationFeatureEnable::GpuAssistedReserveBindingSlot,
        ValidationFeatureEnable::BestPractices,
        ValidationFeatureEnable::SynchronizationValidation,
    ]);
}

And if we put it all together, we get the following validation layer setup routine:

/// Enable Vulkan validation layer in debug builds
fn enable_debug_validation(
    library: &VulkanLibrary,
    instance_info: &mut InstanceCreateInfo,
) -> Result<()> {
    // In debug builds...
    if cfg!(debug_assertions)
       // ...if the validation layer is available...
       && library.layer_properties()?
                 .any(|layer| layer.name() == "VK_LAYER_KHRONOS_validation")
    {
        // ...then enable it...
        instance_info
            .enabled_layers
            .push("VK_LAYER_KHRONOS_validation".into());

        // ...along with most available optional features
        if library
            .supported_layer_extensions("VK_LAYER_KHRONOS_validation")?
            .ext_validation_features
        {
            instance_info.enabled_extensions.ext_validation_features = true;
            instance_info.enabled_validation_features.extend([
                ValidationFeatureEnable::GpuAssisted,
                ValidationFeatureEnable::GpuAssistedReserveBindingSlot,
                ValidationFeatureEnable::BestPractices,
                ValidationFeatureEnable::SynchronizationValidation,
            ]);
        }
    }
    Ok(())
}

To conclude this section, it should be mentioned that the Vulkan validation layer is not featured in the default Vulkan setup of most Linux distributions, and must often be installed separately. For example, on Ubuntu, the vulkan-validationlayers separate package must be installed first. This is one reason why you should never force-enable validation layers in production Vulkan binaries.

Logging configuration

Now that validation layer has been taken care of, let us turn our attention to the other optional Vulkan debugging feature that we highlighted as worth enabling whenever possible, namely logging of messages from the Vulkan implementation.

Vulkan logging is configured using the DebugUtilsMessengerCreateInfo struct. There are three main things that we must specify here:

What message severities we want to handle.
- As in most logging systems, a simple ERROR/WARNING/INFO/VERBOSE classification is used. But in Vulkan, enabling a certain severity does not implicitly enable higher severities, so you can e.g. handle ERROR and VERBOSE messages using different strategies without handling WARNING and INFO messages at all.
- In typical Vulkan implementations, ERROR and WARNING messages should be an exceptional event, whereas INFO and VERBOSE messages can be sent at an unpleasantly high frequency. However an ERROR/WARNING message is often only understandable given the context of previous INFO/VERBOSE messages. It is therefore a good idea to print ERROR and WARNING messages by default, but provide an easy way to print INFO/VERBOSE messages too when needed.
What message types we want to handle.
- Most Vulkan implementation messages will fall in the GENERAL category, but the validation layer may send messages in the VALIDATION and PERFORMANCE category too. As you may guess, the latter messages types report application correctness and runtime performance issues respectively.
What we want to do when a message matches the above criteria.
- Building such a DebugUtilsMessengerCallback is unsafe because vulkano cannot check that your messaging callback, which is triggered by Vulkan API calls, does not make any Vulkan API calls itself. Doing so is forbidden for hopefully obvious reasons.²
- Because we are building simple programs here, where the complexity of a production-grade logging system like syslog is unnecessary, we will simply forward these messages to stderr. For our first Vulkan program, an eprintln!() call will suffice.
- Vulkan actually uses a form of structured logging, where the logging callback does not receive just a message string, but also a bunch of associated metadata about the context in which the message was emitted. In the interest of simplicity, our callback will only print out a subset of this metadata, which should be enough for our purposes.

As mentioned above, we should expose the message severity tradeoff to the user. We can do this using a simple clap CLI interface.

Here we will leverage clap’s Args feature, which lets us modularize our CLI arguments into several independent structs. This will later allow us to build multiple clap-based programs that share some common command-line arguments. Along the way, we will also expose the ability discussed in the beginning of this chapter to probe devices which are not fully Vulkan-compliant.

use clap::Args;

/// Vulkan instance configuration
#[derive(Debug, Args)]
pub struct InstanceOptions {
    /// Expose devices which only support the Vulkan Portability subset
    ///
    /// These devices do not support the full Vulkan feature set, they eliminate
    /// a few API features and device hardware requirements that have proven to
    /// be cumbersome for "exotic" implementations like MoltenVk.
    ///
    /// If you opt into the portability subset, then your app must follow these
    /// restrictions. Please search "VK_KHR_portability_subset" through the
    /// Vulkan specification (
    /// https://registry.khronos.org/vulkan/specs/latest/html/vkspec.html ) for
    /// details.
    #[arg(short, long)]
    pub portability: bool,

    /// Increase Vulkan log verbosity. Can be specified multiple times.
    #[arg(short, long, action = clap::ArgAction::Count)]
    pub verbose: u8,
}

Once we have that, we can set up some basic Vulkan logging configuration…

use vulkano::instance::debug::{
    DebugUtilsMessageSeverity, DebugUtilsMessageType,
    DebugUtilsMessengerCallback, DebugUtilsMessengerCreateInfo
};

/// Generate a Vulkan logging configuration
fn logger_info(options: &InstanceOptions) -> DebugUtilsMessengerCreateInfo {
    // Select accepted message severities
    type S = DebugUtilsMessageSeverity;
    let mut message_severity = S::ERROR | S::WARNING;
    if options.verbose >= 1 {
        message_severity |= S::INFO;
    }
    if options.verbose >= 2 {
        message_severity |= S::VERBOSE;
    }

    // Accept all message types
    type T = DebugUtilsMessageType;
    let message_type = T::GENERAL | T::VALIDATION | T::PERFORMANCE;

    // Define the callback that turns messages to logs on stderr
    // SAFETY: The logging callback makes no Vulkan API call
    let user_callback = unsafe {
        DebugUtilsMessengerCallback::new(|severity, ty, data| {
            // Format message identifiers, if any
            let id_name = data
                .message_id_name
                .map(|id_name| format!(" {id_name}"))
                .unwrap_or_default();
            let id_number = (data.message_id_number != 0)
                .then(|| format!(" #{}", data.message_id_number))
                .unwrap_or_default();

            // Put most information into a single stderr output
            eprintln!("[{severity:?} {ty:?}{id_name}{id_number}] {}", data.message);
        })
    };

    // Put it all together
    DebugUtilsMessengerCreateInfo {
        message_severity,
        message_type,
        ..DebugUtilsMessengerCreateInfo::user_callback(user_callback)
    }
}

Instance and logger creation

Now that we have a logger configuration, we are almost ready to enable logging. There are just two remaining concerns to take care of:

Logging uses the optional Vulkan VK_EXT_debug_utils extension that may not always be available. We must check for its presence and enable it if available.
For mysterious reasons, Vulkan allows programs to use different logging configurations at the time where an Instance is being set up and afterwards. This means that we will need to set up logging twice, once at the time where we create an Instance and another time after that.

After instance creation, logging is taken care of by a separate DebugUtilsMessenger object, which follows the usual RAII design: as long as it is alive, messages are logged, and once it is dropped, logging stop. If you want logging to happen for an application’s entire lifetime (which you usually do), the easiest way to avoid dropping this object too early is to bundle it with your other long-lived Vulkan objects in a long-lived “context” struct.

We will now demonstrate this pattern with a struct that combines a Vulkan instance with optional logging. Its constructor sets up all aforementioned features, including logging if available:

use std::sync::Arc;
use vulkano::instance::{
    debug::DebugUtilsMessenger, Instance, InstanceCreateFlags
};

/// Vulkan instance, with associated logging if available
pub struct LoggingInstance {
    pub instance: Arc<Instance>,
    pub messenger: Option<DebugUtilsMessenger>,
}
//
impl LoggingInstance {
    /// Set up a `LoggingInstance`
    pub fn new(library: Arc<VulkanLibrary>, options: &InstanceOptions) -> Result<Self> {
        // Prepare some basic instance configuration from Cargo metadata
        let mut instance_info = InstanceCreateInfo::application_from_cargo_toml();

        // Show devices that only support the portability subset if directed to
        if options.portability {
            instance_info.flags |= InstanceCreateFlags::ENUMERATE_PORTABILITY
        }

        // Enable validation layers in debug builds
        enable_debug_validation(&library, &mut instance_info)?;

        // Set up logging to stderr if the Vulkan implementation supports it
        let mut log_info = None;
        if library.supported_extensions().ext_debug_utils {
            instance_info.enabled_extensions.ext_debug_utils = true;
            let config = logger_info(options);
            instance_info.debug_utils_messengers.push(config.clone());
            log_info = Some(config);
        }

        // Set up instance, logging creation-time messages
        let instance = Instance::new(library, instance_info)?;

        // Keep logging after instance creation
        let instance2 = instance.clone();
        let messenger = log_info
            .map(move |config| DebugUtilsMessenger::new(instance2, config))
            .transpose()?;
        Ok(LoggingInstance {
            instance,
            messenger,
        })
    }
}

…and once we have that, we can query this instance to enumerate available devices on the system, for the purpose of picking (at least) one that we will eventually run computations on. This will be the topic of the next exercise, and the next chapter after that.

Exercise

Introducing `info`

The exercises/ codebase that you have been provided with contains a set of executable programs (in src/bin), that share some code via a common utility library (at the root of src/). Most of the code introduced in this chapter is located in the instance module of this utility library.

The info executable, whose source code lies in src/bin/info.rs, lets you query some properties of your system’s Vulkan setup. You can think of it as a simplified version of the classic vulkaninfo utility from the Linux vulkan-tools package, with a less overwhelming default configuration.

You can run this executable using the following Cargo command…

cargo run --bin info

…and if your Vulkan implementation is recent enough, you may notice that the validation layer is already doing its job by displaying some warnings:

Click here for example output

[WARNING VALIDATION VALIDATION-SETTINGS #2132353751] vkCreateInstance(): Both GPU Assisted Validation and Normal Core Check Validation are enabled, this is not recommend as it  will be very slow. Once all errors in Core Check are solved, please disable, then only use GPU-AV for best performance.
[WARNING VALIDATION BestPractices-specialuse-extension #1734198062] vkCreateInstance(): Attempting to enable extension VK_EXT_debug_utils, but this extension is intended to support use by applications when debugging and it is strongly recommended that it be otherwise avoided.
[WARNING VALIDATION BestPractices-deprecated-extension #-628989766] vkCreateInstance(): Attempting to enable deprecated extension VK_EXT_validation_features, but this extension has been deprecated by VK_EXT_layer_settings.
[WARNING VALIDATION BestPractices-specialuse-extension #1734198062] vkCreateInstance(): Attempting to enable extension VK_EXT_validation_features, but this extension is intended to support use by applications when debugging and it is strongly recommended that it be otherwise avoided.
Vulkan instance ready:
- Max API version: 1.3.281
- Physical devices:
[WARNING VALIDATION WARNING-GPU-Assisted-Validation #615892639] vkGetPhysicalDeviceProperties2(): Internal Warning: Setting VkPhysicalDeviceVulkan12Properties::maxUpdateAfterBindDescriptorsInAllPools to 32
[WARNING VALIDATION WARNING-GPU-Assisted-Validation #615892639] vkGetPhysicalDeviceProperties2(): Internal Warning: Setting VkPhysicalDeviceVulkan12Properties::maxUpdateAfterBindDescriptorsInAllPools to 32
  0. AMD Radeon Pro WX 3200 Series (RADV POLARIS12)
     * Device type: DiscreteGpu
  1. llvmpipe (LLVM 20.1.6, 256 bits)
     * Device type: Cpu

Thankfully, these warnings are mostly inconsequential:

The VALIDATION-SETTINGS warning complains that we are using an unnecessarily exhaustive validation configuration, which can have a strong averse effect on runtime performance. It suggests running the program multiple times with less extensive validation. This is cumbersome, though, which is why in this course we just let debug builds be slow.
The BestPractices-specialuse-extension warnings complain about our use of debugging-focused extensions. But we do it on purpose to make debugging easier.
The BestPractices-deprecated-extension warning complains about a genuine issue (we are using an old extension to configure the validation layer), however we can’t easily fix this issue right now (vulkano does not support the new configuration mechanism yet).
The WARNING-GPU-Assisted-Validation warnings complain about an internal implementation detail of GPU-assisted validation that we have no control on. It suggests a possible bug in GPU-assisted validation that should be reported at some point.

Other operating modes

By running a release build of the program instead, we see that the warnings go away, highlighting the fact that validation layers are only enabled in debug builds:

cargo run --release --bin info

Click here for example output

Vulkan instance ready:
- Max API version: 1.3.281
- Physical devices:
  0. AMD Radeon Pro WX 3200 Series (RADV POLARIS12)
     * Device type: DiscreteGpu
  1. llvmpipe (LLVM 20.1.6, 256 bits)
     * Device type: Cpu

…however, if you increase the Vulkan log verbosity by specifying the -v command-line option to the output binary (which goes after a -- to separate it from Cargo options), you will see that Vulkan logging remains enabled even in release builds, as we would expect.

cargo run --release --bin info -- -v

Click here for example output

[INFO GENERAL Loader Message] No valid vk_loader_settings.json file found, no loader settings will be active
[INFO GENERAL Loader Message] Searching for implicit layer manifest files
[INFO GENERAL Loader Message]    In following locations:
[INFO GENERAL Loader Message]       /home/hadrien/.config/vulkan/implicit_layer.d
[INFO GENERAL Loader Message]       /home/hadrien/.config/kdedefaults/vulkan/implicit_layer.d
[INFO GENERAL Loader Message]       /etc/xdg/vulkan/implicit_layer.d
[INFO GENERAL Loader Message]       /etc/vulkan/implicit_layer.d
[INFO GENERAL Loader Message]       /home/hadrien/.local/share/vulkan/implicit_layer.d
[INFO GENERAL Loader Message]       /home/hadrien/.local/share/flatpak/exports/share/vulkan/implicit_layer.d
[INFO GENERAL Loader Message]       /var/lib/flatpak/exports/share/vulkan/implicit_layer.d
[INFO GENERAL Loader Message]       /usr/local/share/vulkan/implicit_layer.d
[INFO GENERAL Loader Message]       /usr/share/vulkan/implicit_layer.d
[INFO GENERAL Loader Message]    Found the following files:
[INFO GENERAL Loader Message]       /etc/vulkan/implicit_layer.d/renderdoc_capture.json
[INFO GENERAL Loader Message]       /usr/share/vulkan/implicit_layer.d/MangoHud.x86_64.json
[INFO GENERAL Loader Message]       /usr/share/vulkan/implicit_layer.d/VkLayer_MESA_device_select.json
[INFO GENERAL Loader Message] Found manifest file /etc/vulkan/implicit_layer.d/renderdoc_capture.json (file version 1.1.2)
[INFO GENERAL Loader Message] Found manifest file /usr/share/vulkan/implicit_layer.d/MangoHud.x86_64.json (file version 1.0.0)
[INFO GENERAL Loader Message] Found manifest file /usr/share/vulkan/implicit_layer.d/VkLayer_MESA_device_select.json (file version 1.0.0)
[INFO GENERAL Loader Message] Searching for explicit layer manifest files
[INFO GENERAL Loader Message]    In following locations:
[INFO GENERAL Loader Message]       /home/hadrien/.config/vulkan/explicit_layer.d
[INFO GENERAL Loader Message]       /home/hadrien/.config/kdedefaults/vulkan/explicit_layer.d
[INFO GENERAL Loader Message]       /etc/xdg/vulkan/explicit_layer.d
[INFO GENERAL Loader Message]       /etc/vulkan/explicit_layer.d
[INFO GENERAL Loader Message]       /home/hadrien/.local/share/vulkan/explicit_layer.d
[INFO GENERAL Loader Message]       /home/hadrien/.local/share/flatpak/exports/share/vulkan/explicit_layer.d
[INFO GENERAL Loader Message]       /var/lib/flatpak/exports/share/vulkan/explicit_layer.d
[INFO GENERAL Loader Message]       /usr/local/share/vulkan/explicit_layer.d
[INFO GENERAL Loader Message]       /usr/share/vulkan/explicit_layer.d
[INFO GENERAL Loader Message]    Found the following files:
[INFO GENERAL Loader Message]       /usr/share/vulkan/explicit_layer.d/VkLayer_api_dump.json
[INFO GENERAL Loader Message]       /usr/share/vulkan/explicit_layer.d/VkLayer_monitor.json
[INFO GENERAL Loader Message]       /usr/share/vulkan/explicit_layer.d/VkLayer_screenshot.json
[INFO GENERAL Loader Message]       /usr/share/vulkan/explicit_layer.d/VkLayer_khronos_validation.json
[INFO GENERAL Loader Message]       /usr/share/vulkan/explicit_layer.d/VkLayer_INTEL_nullhw.json
[INFO GENERAL Loader Message]       /usr/share/vulkan/explicit_layer.d/VkLayer_MESA_overlay.json
[INFO GENERAL Loader Message]       /usr/share/vulkan/explicit_layer.d/VkLayer_MESA_screenshot.json
[INFO GENERAL Loader Message]       /usr/share/vulkan/explicit_layer.d/VkLayer_MESA_vram_report_limit.json
[INFO GENERAL Loader Message] Found manifest file /usr/share/vulkan/explicit_layer.d/VkLayer_api_dump.json (file version 1.2.0)
[INFO GENERAL Loader Message] Found manifest file /usr/share/vulkan/explicit_layer.d/VkLayer_monitor.json (file version 1.0.0)
[INFO GENERAL Loader Message] Found manifest file /usr/share/vulkan/explicit_layer.d/VkLayer_screenshot.json (file version 1.2.0)
[INFO GENERAL Loader Message] Found manifest file /usr/share/vulkan/explicit_layer.d/VkLayer_khronos_validation.json (file version 1.2.0)
[INFO GENERAL Loader Message] Found manifest file /usr/share/vulkan/explicit_layer.d/VkLayer_INTEL_nullhw.json (file version 1.0.0)
[INFO GENERAL Loader Message] Found manifest file /usr/share/vulkan/explicit_layer.d/VkLayer_MESA_overlay.json (file version 1.0.0)
[INFO GENERAL Loader Message] Found manifest file /usr/share/vulkan/explicit_layer.d/VkLayer_MESA_screenshot.json (file version 1.0.0)
[INFO GENERAL Loader Message] Found manifest file /usr/share/vulkan/explicit_layer.d/VkLayer_MESA_vram_report_limit.json (file version 1.0.0)
[INFO GENERAL Loader Message] Searching for driver manifest files
[INFO GENERAL Loader Message]    In following locations:
[INFO GENERAL Loader Message]       /home/hadrien/.config/vulkan/icd.d
[INFO GENERAL Loader Message]       /home/hadrien/.config/kdedefaults/vulkan/icd.d
[INFO GENERAL Loader Message]       /etc/xdg/vulkan/icd.d
[INFO GENERAL Loader Message]       /etc/vulkan/icd.d
[INFO GENERAL Loader Message]       /home/hadrien/.local/share/vulkan/icd.d
[INFO GENERAL Loader Message]       /home/hadrien/.local/share/flatpak/exports/share/vulkan/icd.d
[INFO GENERAL Loader Message]       /var/lib/flatpak/exports/share/vulkan/icd.d
[INFO GENERAL Loader Message]       /usr/local/share/vulkan/icd.d
[INFO GENERAL Loader Message]       /usr/share/vulkan/icd.d
[INFO GENERAL Loader Message]    Found the following files:
[INFO GENERAL Loader Message]       /usr/share/vulkan/icd.d/radeon_icd.x86_64.json
[INFO GENERAL Loader Message]       /usr/share/vulkan/icd.d/lvp_icd.x86_64.json
[INFO GENERAL Loader Message] Found ICD manifest file /usr/share/vulkan/icd.d/radeon_icd.x86_64.json, version 1.0.0
[INFO GENERAL Loader Message] Found ICD manifest file /usr/share/vulkan/icd.d/lvp_icd.x86_64.json, version 1.0.0
[INFO GENERAL Loader Message] Insert instance layer "VK_LAYER_MESA_device_select" (libVkLayer_MESA_device_select.so)
[INFO GENERAL Loader Message] vkCreateInstance layer callstack setup to:
[INFO GENERAL Loader Message]    <Application>
[INFO GENERAL Loader Message]      ||
[INFO GENERAL Loader Message]    <Loader>
[INFO GENERAL Loader Message]      ||
[INFO GENERAL Loader Message]    VK_LAYER_MESA_device_select
[INFO GENERAL Loader Message]            Type: Implicit
[INFO GENERAL Loader Message]            Enabled By: Implicit Layer
[INFO GENERAL Loader Message]                Disable Env Var:  NODEVICE_SELECT
[INFO GENERAL Loader Message]            Manifest: /usr/share/vulkan/implicit_layer.d/VkLayer_MESA_device_select.json
[INFO GENERAL Loader Message]            Library:  libVkLayer_MESA_device_select.so
[INFO GENERAL Loader Message]      ||
[INFO GENERAL Loader Message]    <Drivers>
Vulkan instance ready:
- Max API version: 1.3.281
- Physical devices:
[INFO GENERAL Loader Message] linux_read_sorted_physical_devices:
[INFO GENERAL Loader Message]      Original order:
[INFO GENERAL Loader Message]            [0] llvmpipe (LLVM 20.1.6, 256 bits)
[INFO GENERAL Loader Message]            [1] AMD Radeon Pro WX 3200 Series (RADV POLARIS12)
[INFO GENERAL Loader Message]      Sorted order:
[INFO GENERAL Loader Message]            [0] AMD Radeon Pro WX 3200 Series (RADV POLARIS12)  
[INFO GENERAL Loader Message]            [1] llvmpipe (LLVM 20.1.6, 256 bits)  
[INFO GENERAL Loader Message] linux_read_sorted_physical_devices:
[INFO GENERAL Loader Message]      Original order:
[INFO GENERAL Loader Message]            [0] llvmpipe (LLVM 20.1.6, 256 bits)
[INFO GENERAL Loader Message]            [1] AMD Radeon Pro WX 3200 Series (RADV POLARIS12)
[INFO GENERAL Loader Message]      Sorted order:
[INFO GENERAL Loader Message]            [0] AMD Radeon Pro WX 3200 Series (RADV POLARIS12)  
[INFO GENERAL Loader Message]            [1] llvmpipe (LLVM 20.1.6, 256 bits)  
[INFO GENERAL Loader Message] linux_read_sorted_physical_devices:
[INFO GENERAL Loader Message]      Original order:
[INFO GENERAL Loader Message]            [0] llvmpipe (LLVM 20.1.6, 256 bits)
[INFO GENERAL Loader Message]            [1] AMD Radeon Pro WX 3200 Series (RADV POLARIS12)
[INFO GENERAL Loader Message]      Sorted order:
[INFO GENERAL Loader Message]            [0] AMD Radeon Pro WX 3200 Series (RADV POLARIS12)  
[INFO GENERAL Loader Message]            [1] llvmpipe (LLVM 20.1.6, 256 bits)  
[INFO GENERAL Loader Message] linux_read_sorted_physical_devices:
[INFO GENERAL Loader Message]      Original order:
[INFO GENERAL Loader Message]            [0] llvmpipe (LLVM 20.1.6, 256 bits)
[INFO GENERAL Loader Message]            [1] AMD Radeon Pro WX 3200 Series (RADV POLARIS12)
[INFO GENERAL Loader Message]      Sorted order:
[INFO GENERAL Loader Message]            [0] AMD Radeon Pro WX 3200 Series (RADV POLARIS12)  
[INFO GENERAL Loader Message]            [1] llvmpipe (LLVM 20.1.6, 256 bits)  
  0. AMD Radeon Pro WX 3200 Series (RADV POLARIS12)
     * Device type: DiscreteGpu
  1. llvmpipe (LLVM 20.1.6, 256 bits)
     * Device type: Cpu

Hands-on

You can query the full list of available command-line flags using the standard --help command option, which goes after -- like other non-Cargo options. Please play around with the various available CLI options and try to use this utility to answer the following questions:

Is your computer’s GPU correctly detected, or do you only see a llvmpipe CPU emulation device (or worse, no device at all) ?
- Please report absence of a GPU device to the teacher, with a bit of luck we may find the right system configuration tweak to get it to work.
What optional instance extensions and layers does your Vulkan implementation support?
How much device-local memory do your GPUs have ?
What Vulkan extensions do your GPUs support ?
(Linux-specific) Can you tell where on disk the shared libraries featuring Vulkan drivers (known as Installable Client Drivers or ICDs in Khronos API jargon) are stored ?

Once your thirst for system configuration knowledge is quenched, you may then study the source code of this program. Which is admittedly not the prettiest as it priorizes beginner readability over maximal maintainability in more than one place…

Overall, this program demonstrates how various system properties can be queried using the VulkanLibrary and Instance APIs. But not all available properties are exposed because the Vulkan specification is huge and we are only going to cover a subset of it in this course. However, if any property in the documentation linked above gets you curious, do not hesitate to adjust the code of the info program so that it gets printed as well!

…which has recently been deprecated and scheduled for replacement by VK_EXT_layer_settings, but alas vulkano does not support this new layer configuration mechanism yet. ↩
The Vulkan messaging API allows for synchronous implementations. In such implementations, when a Vulkan API call emits a message, it is interrupted midway through its internal processing while the message is being processed. This means that the Vulkan API implementation may be in an inconsistent state (e.g. some thread-local mutex may be locked). If our message processing callback then proceeds to make another Vulkan API call, this new API call will observe that inconsistent implementation state, which can result in an arbitrarily bad outcome (e.g. a thread deadlock in the above example). Furthermore, the new Vulkan API call could later emit more messages, potentially resulting in infinite recursion. ↩

Context

In the previous chapter, we went through the process of loading the system’s Vulkan library, querying its properties, and setting up an API instance, from which you can query the set of “physical”¹ Vulkan devices available on your system.

After choosing one or more² of these devices, the next thing we will want to do is set them up, so that we can start sending API commands to them. In this chapter, we will show how this device setup is performed, then cover a bit of extra infrastructure that you will also usually want in vulkano-based programs, namely object allocators and pipeline caches.

Together, the resulting objects will form a minimal vulkano API context that is quite general-purpose: it can easily be extracted into a common library, shared between many apps, and later extended with additional tuning knobs if you ever need more configurability.

Device selection

As you may have seen while going through the exercise at the end of the previous chapter, it is common for a system to expose multiple physical Vulkan devices.

We could aim for maximal system utilization and try to use all devices at the same time, but such multi-device computations are surprisingly hard to get right.³ In this introductory course, we will thus favor the simpler strategy of selecting and using a single Vulkan device.

This, however, begs the question of which device we should pick:

We could just pick the first device that comes in Vulkan’s device list, which is effectively what OpenGL programs do. But the device list is ordered arbitrarily, so we may face issues like using a slow integrated GPU on “hybrid graphics” laptops that have a fast dedicated GPU available.
We could ask the user which device should be used. But prompting that on every run would get annoying quickly. And making it a mandatory CLI argument would violate the basic UX principle that programs should do something sensible in their default configuration.
We could try to pick a “best” device using some heuristics. But since this is an introductory course we don’t want to spend too much time on fine-tuning the associated logic, so we’ll go for a basic strategy that is likely to pick the wrong device on some systems.

To balance these pros and cons, we will use a mixture of strategies #2 and #3 above:

Through an optional CLI argument, we will let users explicitly pick a device in Vulkan’s device list using the numbering exposed by the info utility when they feel so inclined.
When this CLI argument is not specified, we will rank devices by device type (discrete GPU, integrated GPU, CPU emulation…) and pick a device of the type that we expect to be most performant. This is enough to resolve simple⁴ multi-device ambiguities, such as picking between a discrete and integrated GPU or between a GPU and an emulation thereof.

This device selection strategy makes can be easily implemented using Rust’s iterator methods. Notice that strings can be turned into errors for simple error handling.

use crate::Result;
use clap::Args;
use std::sync::Arc;
use vulkano::{
    device::physical::{PhysicalDevice, PhysicalDeviceType},
    instance::Instance,
};

/// CLI parameters that guide device selection
#[derive(Debug, Args)]
pub struct DeviceOptions {
    /// Index of the Vulkan device that should be used
    ///
    /// You can learn what each device index corresponds to using
    /// the provided "info" program or the standard "vulkaninfo" utility.
    #[arg(short, long)]
    pub device_index: Option<usize>,
}

/// Pick a physical device
fn select_physical_device(
    instance: &Arc<Instance>,
    options: &DeviceOptions,
    quiet: bool,
) -> Result<Arc<PhysicalDevice>> {
    let mut devices = instance.enumerate_physical_devices()?;
    if let Some(index) = options.device_index {
        // If the user asked for a specific device, look it up
        devices
            .nth(index)
            .inspect(|device| {
                if !quiet {
                    eprintln!(
                        "Selected requested device {:?}",
                        device.properties().device_name
                    )
                }
            })
            .ok_or_else(|| format!("There is no Vulkan device with index {index}").into())
    } else {
        // Otherwise, choose a device according to its device type
        devices
            .min_by_key(|dev| match dev.properties().device_type {
                // Discrete GPUs are expected to be fastest
                PhysicalDeviceType::DiscreteGpu => 0,
                // Virtual GPUs are hopefully discrete GPUs exposed
                // to a VM via PCIe passthrough, which is reasonably cheap
                PhysicalDeviceType::VirtualGpu => 1,
                // Integrated GPUs are usually much slower than discrete ones
                PhysicalDeviceType::IntegratedGpu => 2,
                // CPU emulation of GPUs is not known for being efficient...
                PhysicalDeviceType::Cpu => 3,
                // ...but it's better than other types we know nothing about
                PhysicalDeviceType::Other => 4,
                _ => 5,
            })
            .inspect(|device| {
                if !quiet {
                    eprintln!("Auto-selected device {:?}", device.properties().device_name)
                }
            })
            .ok_or_else(|| "No Vulkan device available".into())
    }
}

Notice the quiet boolean parameter, which suppresses console printouts about the GPU device in use. This will come in handy when we will benchmark context building at the end of the chapter.

Device and queue setup

Once we have selected a PhysicalDevice, we must set it up before we can use it. There are similarities between this process and that of building an Instance from a VulkanLibrary: in both cases, after discovering what our system could do, we must specify what it should do.

One important difference, however, is that the device setup process produces more than just a Device object, which is used in a wide range of circumstances from compiling GPU programs to allocating GPU resources. It also produces a set of Queue objects, which we will later use to submit commands for asynchronous execution.

These asynchronous commands are very important because they implement the tasks that a well-optimized Vulkan program will spend most of its GPU time doing. For example, they can be used to transfer data between CPU and GPU memory, or to execute GPU code.

We’ll give this command scheduling process the full attention it deserves in a subsequent chapter, but at this point, the main thing you need to know is that a typical GPU comes with not one, but several hardware units capable of receiving commands from the CPU and scheduling them for execution on the GPU. These command scheduling units have the following characteristics:

They operate in parallel, but the underlying hardware resources on which submitted work aventually executes are shared between them.
They process commands in a mostly FIFO fashion, and are thus called queues in the Vulkan specification. But they do not fully match programmer intuition about queues, because they also have a limited and hardware-dependent ability to run some commands in parallel.
- For example, if a GPU program does not fully utilize available execution resources and the next command schedules execution of another GPU program, the two programs may end up running concurrently.
Due to hardware limitations, you will often need to submit commands to several queues concurrently in order to fully utilize the GPU’s resources.
Some queues may be specialized in executing specific kinds of commands (e.g. data transfer commands) and unable to execute other kinds of commands.

Vulkan exposes this hardware feature in the form of queue families whose basic properties can be queried from a PhysicalDevice. Each queue family represents a group of hardware queues. At device initialization time, we must request the creation of one or more logical queues and specify which hardware queues they should map to.

Unfortunately, the Vulkan API really provides little information about queue families, and it will often take a round trip through manufacturer documentation to get a better understanding of what the various queue families represent in hardware and how multiple hardware queues should be used.

However, our introductory number-squaring program is so simple that it does not benefit that much from multiple Vulkan queues anyway. Therefore, in this first part of the course, we can take the shortcut of allocating a single queue that maps into the first queue family that supports compute operations (which, per the Vulkan specification, implies support for data transfer operations).

use vulkano::device::QueueFlags;

/// Pick the first queue family that supports compute operations
///
/// While the Vulkan specification does not mandate that such a queue family
/// exists, it does mandate that if any family supports graphics operations,
/// then at least one family must support compute operations. And a Vulkan
/// device that supports no graphics operation would be very much unexpected...
fn queue_family_index(device: &PhysicalDevice) -> usize {
    device
        .queue_family_properties()
        .iter()
        .position(|family| family.queue_flags.contains(QueueFlags::COMPUTE))
        .expect("Device does not support compute (or graphics)")
}

Knowing this queue family index, setting up a device with a single queue from this family becomes rather straightforward:

use vulkano::device::{Device, DeviceCreateInfo, Queue, QueueCreateInfo};

/// Set up a device with a single command queue that can schedule computations
/// and memory transfer operations.
fn setup_device(device: Arc<PhysicalDevice>) -> Result<(Arc<Device>, Arc<Queue>)> {
    let queue_family_index = queue_family_index(&device) as u32;
    let (device, mut queues) = Device::new(
        device,
        DeviceCreateInfo {
            queue_create_infos: vec![QueueCreateInfo {
                queue_family_index,
                ..Default::default()
            }],
            ..Default::default()
        },
    )?;
    let queue = queues
        .next()
        .expect("We asked for one queue, we should get one");
    Ok((device, queue))
}

As when creating an instance before, this is a place where we could enable optional Vulkan API extensions supported by the physical device. But in the case of devices, these extensions are supplemented by a related concept called features, which represent optional Vulkan API functionality that our device may or may not support.

As you may guess, the nuance between these two concepts is subtle:

Features do not need to come from extensions, they may exist even in the core Vulkan specification. They model optional functionality that a device may or may not support, or that an application may or may not want to enable.
- An example of the former is the ability to perform atomic operations on floating-point data inside GPU programs. Hardware support for these operations varies widely.
- An example of the latter is the ability to make accesses to memory resources bound-checked in order to reduce avenues for undefined behavior. This is important for e.g. web browsers that execute untrusted GPU code from web pages, but comes at a performance cost that performance-sensitive apps may want to avoid.
Extensions may want to define features even if the mere act of enabling an extension is arguably an opt-in for optional functionality, if the functionality of interest can be further broken down into several closely related sub-parts.
- For example the former VK_KHR_8bit_storage extension (now part of Vulkan 1.2 core), which specified the ability for GPU code to manipulate 8-bit integers, provided 3 separate feature flags to represent ability to manipulate 8-bit integers from 3 different kinds of GPU memory resources (storage buffers, uniform buffers, and push constants).

Pipeline cache

In programming languages that favor ahead-of-time compilation like Rust and C/++, compilers know a fair bit about the CPU ISA that the program is destined to run on, enough to emit machine code that the target CPUs can process directly. This allows pure-CPU Rust programs to execute at top speed almost instantly, without the slow starts that plagues programming languages that prefer to postpone compilation work to runtime (just-in-time compilation) like Julia, Java and C# do.

GPU programs, however, cannot enjoy this luxury when hardware portability is desired, because the set of GPU architectures that even a single CPU architecture can host is very large and GPU instruction sets are not designed with backwards compatibility in mind.⁵ As a result, just-in-time compilation is the dominant paradigm in the GPU world, and slow startup is a common issue in even slightly complex GPU programs.

Over time, various strategies have been implemented to mitigate this issue:

Following the lead of Java and C#, GPU programming APIs have gradually replaced C-based GPU programming languages with pre-compiled intermediate representations like SPIR-V, which are closer to machine code and can be turned more quickly into an optimized binary for the target GPU hardware. This also had the desirable side-effect of improving the reliability of GPU drivers, which have a notoriously hard time correctly compiling high-level languages.
GPU drivers have tried to avoid compilation entirely after the first program run via caching techniques, which lets them reuse previously compiled binaries if the input program has not changed. Unfortunately, detecting if a program has changed can be a surprisingly hard problem in the presence of external dependencies like those brought by the C #include directive. And it is unwise to push such fun cache invalidation challenges onto GPU driver developers who are not known for their attention to software quality. Furthermore, making this caching process implicit also prevents GPU applications from supplementing the just-in-time compilation process with pre-compiled binaries for common system configurations, so that programs can run fast right from the first run in some best-case scenarios.

Acknowledging the issues of the overly implicit binary caching approaches of its predecessors,⁶ Vulkan enforces a more explicit caching model in which applications are in direct control of the cache that holds previously compiled GPU programs. They can therefore easily flush the cache when a fresh compilation is desired, or save it to files and share it across machines as needed.

The provided code library contains a PersistentPipelineCache struct that leverages this functionality to cache previously compiled GPU code across program runs, by saving the pipeline cache into a standard OS location such as the XDG ~/.cache directory on Linux. These standard locations are easily looked up in a cross-platform manner using the directories crate. As vulkano’s PipelineCache API is rather basic and easy to use, this code is mostly about file manipulation and not very interesting from a Vulkan teaching perspective, so we will not describe it here. Please look it up in the provided example codebase if interested, and ask any question that arises!

Allocators

Ever since the existence of absolute zero temperature has been demonstrated by statistical physics, top minds in cryogenics have devoted enormous resource to get increasingly close to it, to the point where humanity can nowadays reliably cool atom clouds down to millionths of a degree above absolute zero. But awe-inspiring as it may be, this technological prowess pales in comparison to how close GPU driver memory allocators have always been to absolute zero performance.

The performance of GPU driver memory allocators is so incredibly bad that most seasoned GPU programmers avoid calling the GPU API’s memory allocator at all costs. They do so through techniques like application side sub-allocation and automatic allocation reuse, which would be relatively advanced by CPU programming standards.⁷ Acknowledging this, vulkano supports and encourages the use of application-side memory allocators throughout its high-level API.

Vulkan differentiates three categories of memory objects that are allocated using completely different APIs, likely because they may map onto different memories in some GPUs. This unsuprisingly maps into three vulkano memory allocator objects that must be set up independently setup (and can be independently replaced with alternate implementations if needed):

The StandardMemoryAllocator is used to allocate large and relatively long-lived memory resources like buffers and images. These are likely to be what first comes to your mind when thinking about GPU memory allocations.
The StandardDescriptorSetAllocator is used to allocated descriptor sets, which are groups of the above memory resources. Resources are grouped like this so that you can attach them to GPU programs using bulk operations, instead of having to do it on a fine-grained basis which was a common performance bottleneck of older GPU APIs.
The StandardCommandBufferAllocator can be used to allocate command buffers, which are small short-lived objects entities that are created every time you submit commands to the GPU. As you can imagine, this allocator is at a higher risk of becoming a performance bottleneck than others, which is why Vulkan allows you to amortize its overhead by submitting commands in bulk as we will see in a subsequent chapter.

Since the default configuration is fine for our purposes, setting up these allocators is rather straightforward. There is just one API curiosity that must be taken care of, namely that unlike every other object constructor in vulkano’s API, the constructors of memory allocators do not automatically wrap them in atomically reference-counted Arc pointers. This must be done before they can be used with vulkano’s high-level safe API, so you will need to do this on your side:

use vulkano::{
    command_buffer::allocator::StandardCommandBufferAllocatorCreateInfo,
    descriptor_set::allocator::StandardDescriptorSetAllocatorCreateInfo,
};

// A few type aliases that will let us more easily switch to another memory
// allocator implementation if we ever need to
pub type MemoryAllocator = vulkano::memory::allocator::StandardMemoryAllocator;
pub type CommandBufferAllocator =
    vulkano::command_buffer::allocator::StandardCommandBufferAllocator;
pub type DescriptorSetAllocator =
    vulkano::descriptor_set::allocator::StandardDescriptorSetAllocator;

/// Set up all memory allocators required by the high-level `vulkano` API
fn setup_allocators(
    device: Arc<Device>,
) -> (
    Arc<MemoryAllocator>,
    Arc<DescriptorSetAllocator>,
    Arc<CommandBufferAllocator>,
) {
    let malloc = Arc::new(MemoryAllocator::new_default(device.clone()));
    let dalloc = Arc::new(DescriptorSetAllocator::new(
        device.clone(),
        StandardDescriptorSetAllocatorCreateInfo::default(),
    ));
    let calloc = Arc::new(CommandBufferAllocator::new(
        device,
        StandardCommandBufferAllocatorCreateInfo::default(),
    ));
    (malloc, dalloc, calloc)
}

Putting it all together

With that, we reach the end of the Vulkan application setup that is rather problem-agnostic and could easily be shared across many applications, given the possible addition of a few extra configuration hooks (e.g. a way to enable Vulkan extensions if our apps use them).

Let’s recap the vulkano objects that we have set up so far and will need later in this course:

A Device is the initialized version of a PhysicalDevice. It is involved in most API operations that optimized programs are not expected to spend a lot of time doing, like setting up compute pipelines or allocating memory resources. To keep this introductory course simple, we will only use a single (user- or heuristically-selected) device.
At device setup time, we also request the creation of one or more Queues. These will be used to submit GPU commands that may take a while to execute and remain frequently used after the initial application setup stage. Use of multiple queues can help performance, but is a bit of a hardware-specific black art so we will not discuss it much.
To avoid recompiling GPU code on each application startup, it is good practice to set up a PipelineCache and make sure that its contents are saved on application shutdown and reloaded on application startup. We provide a simple PersistentPipelineCache abstraction that handles this in a manner that honors OS-specific cache storage recommendations.
Because GPU driver allocators are incredibly slow, supplementing them with an application-side allocator that calls into them as rarely as possible is necessary for optimal performance. We will need one for GPU memory resources, one for descriptor sets (i.e. sets of memory resources), and one for command buffers. For this course’s purpose, the default allocators provided by vulkano will do this job just fine without any special settings tweaks.
And finally, we must keep around the DebugUtilsMessenger that we have set up in the previous chapter, which ensures that any diagnostics message emitted by the Vulkan implementation will still pop up in our terminal for easy debugging.

To maximally streamline the common setup process, we will group all these objects into a single Context struct whose constructor takes care of all the setup details seen so far for us:

/// CLI parameters for setting up a full `Context`
#[derive(Debug, Args)]
pub struct ContextOptions {
    /// Instance configuration parameters
    #[command(flatten)]
    pub instance: InstanceOptions,

    /// Device selection parameters
    #[command(flatten)]
    pub device: DeviceOptions,
}

/// Basic Vulkan setup that all our example programs will share
pub struct Context {
    pub device: Arc<Device>,
    pub queue: Arc<Queue>,
    pipeline_cache: PersistentPipelineCache,
    pub mem_allocator: Arc<MemoryAllocator>,
    pub desc_allocator: Arc<DescriptorSetAllocator>,
    pub comm_allocator: Arc<CommandBufferAllocator>,
    _messenger: Option<DebugUtilsMessenger>,
}
//
impl Context {
    /// Set up a `Context`
    pub fn new(options: &ContextOptions, quiet: bool) -> Result<Self> {
        let library = VulkanLibrary::new()?;
        let mut logging_instance = LoggingInstance::new(library, &options.instance)?;
        let physical_device =
            select_physical_device(&logging_instance.instance, &options.device, quiet)?;
        let (device, queue) = setup_device(physical_device)?;
        let pipeline_cache = PersistentPipelineCache::new(device.clone())?;
        let (mem_allocator, desc_allocator, comm_allocator) = setup_allocators(device.clone());
        let _messenger = logging_instance.messenger.take();
        Ok(Self {
            device,
            queue,
            pipeline_cache,
            mem_allocator,
            desc_allocator,
            comm_allocator,
            _messenger,
        })
    }

    /// Get a handle to the pipeline cache
    pub fn pipeline_cache(&self) -> Arc<PipelineCache> {
        self.pipeline_cache.cache.clone()
    }
}

Exercise

For now, the square binary does nothing but set up a basic Vulkan context as described above. Run a debug build of it with the following command…

cargo run --bin square

…and make sure that it executes without errors. A few warnings from the validation layers are expected. Some were discussed in the previous chapter, while most of the new ones warn you that the GPU-assisted validation layer has force-enabled a few optional Vulkan features that we do not need, because its implementation does need them.

Once this is done, take a moment to look at the definition of the Context struct above, and make sure you have a basic understanding of what its components are doing or will later be useful for. Do not hesitate to quickly review the previous chapters and the vulkano documentation as necessary.

If you are curious and relatively ahead of the group in terms of progress, consider also checking out the constructors of the various vulkano objects involved in order to learn more about the many optional features and configuration tunables that we could have used, but chose not to.

Vulkan physical devices may sadly not map into a physical piece of hardware in your computer. For example Linux users will often see the llvmpipe GPU emulator in their physical device list. The reason why Vulkan calls them physical devices anyway is that some API naming trick was needed in order to distinguish these uninitialized devices that can just be queried for properties, from the initialized device objects that we will spend most of our time using later on. ↩
Part of the reason why Vulkan makes device selection explicit, instead of arbitrarily picking one device by default like most GPU APIs do, is that it makes multi-GPU workflows easier. Since you are always specifying which device you are using as a parameter your Vulkan commands, refactoring a program that uses a single GPU to use multiple ones is easier when using Vulkan. This is great because single-device programs are easier to write and test and therefore best for initial prototyping. ↩
Among other things, multi-GPU programs may require load balancing between devices of unequal performance capabilities, more complex profiling and debugging workflows, careful balance between the goals of using all available computing power and avoiding slow cross-device communication… and these are just the most obvious issues. More advanced concerns include the inefficiency of using a CPU-based GPU emulation compared to an optimized CPU implementation, and thermal throttling issues that arise when firing up multiple devices that share a common heatsink like a CPU and its integrated GPU. ↩
One example of a system environment where this simple strategy is not good enough would be a worker node in an HPC center running an older version of the Slurm scheduler. These nodes typically contain a number of nearly-identical GPUs that only differ by PCI bus address and UUID. Older versions of Slurm would expose all GPUs to your program, but tell it which GPUs were allocated to your job using an environment variable whose name and syntax is specific to the underlying GPU vendor. Vendor-specific compute runtimes like NVidia CUDA and AMD ROCm would then parse these environment variables and adjust their implicit device selection strategy accordingly. As you can imagine, implementing this sort of vendor-specific hackery does not amuse the Vulkan programmer, but thankfully newer versions of Slurm have finally learned how to hide unallocated GPUs using cgroups. ↩
Even binary format compatibility is not guaranteed, so a GPU driver update can be all it takes to break binary compatibility with previously compiled GPU programs. ↩
To be fair, an attempt was made in previous GPU APIs like OpenGL and OpenCL to allow programmers to export and manage pre-compiled GPU modules and programs. But it was later discovered that this feature had to be supplemented with extra compilation and caching on the GPU driver side, which defeated its purpose. Indeed, the most optimized version of a GPU program could depend on some specifics of how memory resources were bound to it, and in legacy APIs this was not known until resource binding time, which would typically occur after unsuspecting developers had already exported their GPU binaries. This is why the notion of graphics and compute pipelines, which we will cover soon, was introduced into Vulkan. ↩
Largely because any self-respecting libc memory allocator implementation already features these optimizations. Which means that it is only in relatively niche use cases that programmers will benefit from re-implementing these optimizations themselves, without also coming to the realization that they are doing a lot more memory allocations than they should and could achieve a much greater speedup by rethinking their memory management strategy entirely. ↩

Pipeline

Now that we have set up some generic Vulkan infrastructure, we are ready to start working on our specific problem, namely squaring an array of floating-point numbers. The first step in this journey will be to set up a ComputePipeline, which is a GPU program that can performs the squaring operation on some partially unspecified memory location. But as you will see in this chapter, this process involves a suprisingly large number of steps.

Choosing a language

In GPU APIs, GPU-side code is traditionally written using a domain-specific programming language. Each major GPU API would provide its own language, so for a long time the top players in the portable GPU API space were GLSL for OpenGL, OpenCL C for OpenCL and HLSL for Direct3D.

More recently these old-timers have been joined by MSL for Metal and WGSL for WebGPU. But most importantly Khronos APIs have moved away from ingesting GPU programs written in a specific programmer-facing language, and are instead defined in terms of an assembly-like intermediate compiler representation called SPIR-V. This has several benefits:

GPU drivers become simpler, they don’t bundle a full compiler for a C-like language anymore. This allows application developers to have faster GPU code compilation and less driver bugs.
GPU programs go through a first round of compilation during the application-building process, which provides opportunities for faster compile-time error reporting (before the application starts) and more reliable driver-agnostic program optimizations.
Interoperability between GPU APIs becomes easier because translating each GPU DSL to SPIR-V is easier than translating from one DSL to another.
Introducing new GPU programming languages like Slang, or adapting CPU-oriented programming languages like Rust for GPU programming, becomes easier.

The last point begs the question: should this course keep using the traditional GLSL programming language from Khronos, embrace to a more modern GPU programming language like Slang, or leverage the rust-gpu project to get rid of the cross-language interface and be able to write all code in Rust? For this edition, we chose to keep using GLSL for a few reasons:

Vulkan is specified in terms of SPIR-V, but given SPIR-V’s assembly-like nature, writing this course’s code examples directly in SPIR-V would not be a pedagogically sensible option.
Because the Khronos Group maintains all of the Vulkan, SPIR-V and GLSL specifications, they are quick to extend GLSL with any feature that gets added to SPIR-V. This means that any new GPU programming feature that gets added to Vulkan and SPIR-V will be usable from GLSL first, before it gets added to any other GPU programming language.
Large amounts of existing Vulkan code and training material is written in terms of GLSL programs. So if you need help with your Vulkan code, you are more likely to find it with GLSL than if you use any other language that compiles to SPIR-V.
GLSL is a rather easy language to learn. Being purpose-built for GPU programming, it also naturally integrates several GPU hardware features and limitations that will feel quite out of place in a general-purpose CPU programming language like Rust or C/++.¹
As rust-gpu specifically is neither very mature nor well-documented, integrating it into a vulkano-bases application involves jumping through a few hoops. In contrast, GLSL enjoys good first-party documentation and direct integration into vulkano (via vulkano-shaders) that make it particularly easy to use in vulkano-based applications.

Number-squaring shader

Like OpenGL before it, Vulkan supports two different styles of GPU programs or pipelines:

Graphics pipelines are designed for traditional 3D graphics applications. These typically² render textured triangles to a bitmap target (like a screen) through a complex multi-stage pipeline, where some stages are customizable via user-defined code, and others are implemented using specialized hardware. Each of the user-defined hooks is commonly called a shader (vertex shader, tesselation shader, fragment shader…), likely because the eventual output is a shade of color. These pipelines have been a staple of GPU APIs since the early 2000s.
Compute pipelines were introduced much later, in the early 2010s, following the availability of increasingly general-purpose GPU hardware on which triangle-rendering became a special case rather than a core hardware feature. They greatly simplify the aforementioned multi-stage pipeline into a single compute shader stage, which is more appropriate for computations that do not naturally bend into a triangle-rendering shape.

Because it is focused on general-purpose numerical computations, this course will exclusively discuss compute pipelines and shaders. Our first number-squaring program will therefore be implemented as a GLSL compute shader.

Unlike other programming languages, GLSL makes language version requirements something that is directly specified by the program, rather than indirectly requested through e.g. compiler flags. Our GLSL program thus starts by stating which GLSL specification revision it is written against:

#version 460

We then specify how our GPU code will interface with CPU-side code. This is a danger zone. Any change to this part will often need be accompanied by matching changes to the CPU-side code.

First of all, we begin by specifying how our GPU progrma will exchange data with the outside world. CPU-GPU interfaces are specified using GLSL interface blocks, and the particular kind of interface block that we are using here is called a shader storage block.

// Shader storage block used to feed in the input/output data
layout(set = 0, binding = 0) buffer DataBuffer {
  float data[];
} Data;

Let’s break down this unfamiliar GLSL syntax:

With the buffer keyword, we tell the GLSL compiler that before we run this program, we are going to attach a buffer to it. Buffers represent blocks of GPU-accessible memory with a user-defined data layout, and are one of the two basic kinds of Vulkan memory resources (the other being images, which model texturing units).
To bind a buffer to this shader storage block, we will to need to refer to it using some identifier on the host side. In GLSL, integer identifiers are specified inside of a layout() clause for this purpose. Vulkan uses hierarchical identifiers composed of two numbers, a set number and a relative binding identifier within that set. This allows resources to be bound at the granularity of entire sets, thus amortizing the overhead of binding operations which were a common performance bottleneck in GPU APIs before the Vulkan days.
Interface blocks must have a block name (here DataBuffer), which in the case of compute pipelines³ is only used by host-side tooling like error messages and debugging tools. They may also have an instance name (here Data), which is used to scope the inner data members, otherwise members of the storage block will be exposed at global scope.
Finally, in a pair of curly braces between the block name and the instance name, a set of data members is defined, with a syntax similar to that of a C struct declaration.⁴ As in C, the last member can be a dynamically sized array, which is how we express that our buffer contains just a simple array of single-precision floating-point numbers.

After specifying our shader’s input/output data configuration, we then specify its execution configuration by setting a default workgroup size and a specialization constant that can be used to change the workgroup size from the CPU side.

// 1D shader workgroups default to 64 work items, this can be reconfigured
layout(local_size_x = 64) in;
layout(local_size_x_id = 0) in;

Again, this warrants some explanations:

As we will later see, a computer shader executes as a one- to three-dimensional grid of workgroups, each of which contains an identically sized chunk of work items which are sequential tasks that are relatively independent from each other.⁵
The size of workgroups represent the granularity at which work items are distributed across the GPU’s compute units and awaited for completion. In more advanced GPU programs, it also controls the granularity at which work items may easily synchronize with each other. Because this parameter affects many aspects of compute shader execution, shaders will often execute most efficiently at a certain hardware- and workload-dependent workgroup size that is hard to predict ahead of time and best tuned through empirical benchmarking.
Because of this, and because the correctness of a compute shader depends on how many work items are spawned but the Vulkan API for executing compute pipelines specifies how many workgroups are spawned, it is best if the size of workgroups is controlled from the CPU side and easily tunable for some particular hardware, rather than hardcoded in GLSL.
Therefore, although GLSL only mandates that a default workgroup size be specified in the shader via the layout(local_work_size...) in; syntax, we additionally use the layout(local_work_size_id...) in; syntax to define a specialization constant⁶ associated with the workgroup size. We will later use it to change the workgroup size from CPU code.
We only need to specify the first workgroup dimension (x) in this one-dimensional computation, GLSL will automatically infer the remaining dimensions to be equal to 1.

Finally, after specifying how the shader interfaces with the CPU, we can write the entry point that performs the expected number squaring:

void main() {
  uint index = gl_GlobalInvocationID.x;
  if (index < Data.data.length()) {
    Data.data[index] *= Data.data[index];
  }
}

Here again, although this code snippet is short, it has several aspects worth highlighting:

Following an old tradition, GLSL mandates that our entry point be called main(), takes no parameter and returns no output value.
Inside of the entry point, we specify the work that each of the work items is meant to do, treating that work item as a sequential task running in parallel with all other work items. In our case, each work item squares a single floating-point number⁷ from the data array that we have previously declared as part of the Data shader storage block.
To know which work item we are dealing with, we use the gl_GlobalInvodationID built-in variable from GLSL. This provides a 3D coordinate of the active work item within the overall 3D grid of all work items. Here our problem is inherently one-dimensional, so we treat that 3D grid as a 1D grid by setting its second and third dimension to 1. Thus we only care about the first dimension (x coordinate) of the gl_GlobalInvodationID integer vector.
In an ideal world, we would like to execute this compute shader in a configuration that has exactly one work item per floating-point number in the Data.data array, and that would be the end of it. But in the real world, we cannot do so, because we can only spawn a number of work items that is a multiple of our workgroup size, and said workgroup size must be large enough (typically a multiple of 64)⁸ for us to achieve good execution efficiency. Thus we will need to spawn a few extra work items and cut them out from the computation using the kind of bounds checking featured in the code above.
Although designed to look like C, GLSL is rather different from C from a semantics point of view. In particular it has no pointers, few implicit conversions, and even dynamically sized arrays have a known size that can be queried using a .length() built-in method.

Putting it all together, our float-squaring compute shader looks like this:

#version 460

// Shader storage block used to feed in the input/output data
layout(set = 0, binding = 0) buffer DataBuffer {
  float data[];
} Data;

// 1D shader workgroups default to 64 work items, this can be reconfigured
layout(local_size_x = 64) in;
layout(local_size_x_id = 0) in;

void main() {
  uint index = gl_GlobalInvocationID.x;
  if (index < Data.data.length()) {
    Data.data[index] *= Data.data[index];
  }
}

We will now proceed to save it into a source file. File extension .comp is advised for easy compatibility with text editor plugins like GLSL linters, so we propose to save it at location exercises/src/square.comp. And once this is done, we will be able to proceed with the next step, which is to load this GPU code into our Rust program and start building a compute pipeline out of it.

SPIR-V interface

As mentioned earlier, Vulkan cannot directly use GLSL shaders. They must first be compiled into an assembly-like intermediate representation called SPIR-V. This entails the use of tools like shaderc during application compilation, which slightly complicates their build process.

Because we use the higher-level vulkano Vulkan bindings, however, we have access to its optional vulkano-shaders component, which makes usage of GLSL shaders a fair bit easier. To be more specific, vulkano-shaders currently provides the following basic functionality:

Download and build an internal copy of the shaderc GLSL to SPIR-V compiler if it is not installed on the host system.
Use shaderc at compilation time to translate the application’s GLSL shaders to SPIR-V, then bundle the resulting SPIR-V into the application executable, automatically updating it whenever the original GLSL source code changes.
Generate a load() function to turn the raw SPIR-V binary into a higher-level ShaderModule Vulkan object. If the shader uses optional GLSL/SPIR-V features, this function will also check that the target device supports them along the way.
Translate struct definitions from the GLSL code into Rust structs with identical member names and memory layout. In the case of GLSL linear algebra types like mat3, the translation can be customized to use types from various popular Rust linear algebra libraries.

To use this package, we must first add vulkano-shaders as a dependency. This has already been done for you in the provided source code:

# You do not need to type in this command, it has been done for you
cargo add --features "shaderc-debug" vulkano-shaders

Notice that we enable the optional shaderc-debug, which ensures that our shaders are compiled with debug information. This is useful when running any kind of GPU debugging or profiling tool on our programs, so you would normally only want to disable this feature in production builds.

After this is done, you can create a new Rust code module dedicated to our new compute pipeline. To this end, you can first declare a new module in the toplevel exercises/src/lib.rs source file…

pub mod square;

…then create the associated exercises/src/square.rs source file with the following content:

//! Number-squaring compute pipeline

/// Compute shader used for number squaring
mod shader {
    vulkano_shaders::shader! {
        ty: "compute",
        path: "src/square.comp"
    }
}

Unfortunately, vulkano-shaders lacks a feature for automatically generating Rust-side constants matching the integer identifiers in the shader’s interface. As a fallback, it is a good idea to have some Rust constants that mirror them. This is not great as we will have to update those constants anytime we change the matching GLSL code, but it is better than guessing the meaning of hardcoded integer identifiers throughout our Rust codebase whenever the GLSL code changes.

#![allow(unused)]
fn main() {
/// Descriptor set that is used to bind a data buffer to the shader
pub const DATA_SET: u32 = 0;

/// Binding within `DATA_SET` that is used for the data buffer
pub const DATA_BINDING: u32 = 0;

/// Specialization constant that is used to adjust the workgroup size
const WORKGROUP_SIZE: u32 = 0;
}

Specializing the code

As mentioned earlier, it is generally wiser to specify the compute shader’s workgroup size from CPU code. This can be done using the SPIR-V specialization constant mechanism, and we have set up a suitable specialization constant on the GLSL side to allow this.

Given this previous preparation, we can now add a suitable CLI parameter to our program and use it to specialize our SPIR-V shader to the workgroup size that we would like. Here is a way to do so, taking some extra care to detect GLSL/Rust code desynchronization along the way:

use crate::Result;
use clap::Args;
use std::{num::NonZeroU32, sync::Arc};
use vulkano::{
    device::Device,
    shader::{SpecializationConstant, SpecializedShaderModule},
};

/// CLI parameters that guide pipeline creation
#[derive(Debug, Args)]
pub struct PipelineOptions {
    /// 1D workgroup size
    ///
    /// Vulkan guarantees support of any workgroup size from 1 to 1024, but
    /// a multiple of 64 is best for real-world hardware.
    #[arg(short, long, default_value = "64")]
    pub workgroup_size: NonZeroU32,
}

/// Set up a specialized shader module with a certain workgroup size
fn setup_shader_module(
    device: Arc<Device>,
    options: &PipelineOptions,
) -> Result<Arc<SpecializedShaderModule>> {
    // Build a shader module from our SPIR-V code, checking device support
    let module = shader::load(device)?;

    // Check default specialization constant values match expectations
    //
    // This allows us to detect some situations in which the GLSL interface has
    // changed without a matching CPU code update, which can otherwise result in
    // weird application bugs.
    let mut constants = module.specialization_constants().clone();
    assert_eq!(
        constants.len(),
        1,
        "Only expected one specialization constant"
    );
    let workgroup_size = constants
        .get_mut(&WORKGROUP_SIZE)
        .expect("There should be a workgroup size specialization constant");
    assert!(
        matches!(workgroup_size, SpecializationConstant::U32(_)),
        "Workgroup size specialization constant should be a GLSL uint = u32 in Rust",
    );

    // Specify the shader workgroup size
    *workgroup_size = SpecializationConstant::U32(options.workgroup_size.get());

    // Specialize the shader module accordingly
    Ok(module.specialize(constants)?)
}

Entry point and pipeline stage

The word shader unfortunately has a slightly overloaded meaning in the GPU programming community. GPU programmers like this course’s author commonly use it to refer to the implementation of a particular stage of a graphics or compute pipeline, which corresponds to a single GLSL/SPIR-V entry point. But the Vulkan specification actually calls “shader” a GPU compilation unit that is allowed to contain multiple entry points.

This generalized definition is particularly useful when implementing graphics pipelines, which have multiple stages. It means you can implement all pipeline stages inside of a single GLSL source file and have them easily share common type, constant and interface definitions, without performing redundant compilation of shared GLSL interface blocks.

On the compute side of things, however, compute pipelines only have a single stage, so compute shader modules with multiple entry points do not exist in GLSL. Yet SPIR-V still allows graphics shaders and non-GLSL compute shaders to have multiple entry points, and thus we will need one more step to locate the single entry point from our GLSL shader:

let entry_point = module
    .single_entry_point()
    .expect("GLSL compute shader should have a single entry point");

This entry point can then be turned into a pipeline stage by specifying, as you may have guessed, optional pipeline stage configuration parameters.

At the time of writing, the only thing we can configure at this stage is the subgroup size, which is effectively the SIMD granularity with which the device processes work items during compute shader execution. This is only configurable on a few devices (Intel GPUs being the main example that comes to mind), and even when it is manually configurable the default automatic configuration usually does a good job, so we will stick with this default here.

use vulkano::pipeline::PipelineShaderStageCreateInfo;

/// Set up a compute stage from a previously specialized shader module
fn setup_compute_stage(module: Arc<SpecializedShaderModule>) -> PipelineShaderStageCreateInfo {
    let entry_point = module
        .single_entry_point()
        .expect("GLSL compute shader should have a single entry point");
    PipelineShaderStageCreateInfo::new(entry_point)
}

Pipeline layout

We are reaching the end of the compute pipeline building process and there is only one remaining major configuration step to take care of, namely pipeline layout configuration.

To understand what this step is about, we need to know that a Vulkan compute pipeline combines two things that used to be separate in earlier GPU APIs, namely a set of GPU instructions (entry point) and some long-lived metadata that tells the GPU compiler ahead of time how memory resources are going to be bound to this GPU program (pipeline layout).

The latter notion is newly exposed in Vulkan, as in earlier GPU APIs this information used to be inferred by the GPU driver from the actual resource-binding pattern used by the application. This meant the GPU driver could end up having to recompile GPU code while the application was running if the resource-binding was not what the driver expected at compile time. In a graphical rendering context this would result in short application freezes, known as stutter, as rendering would momentarily stop while the GPU driver was waiting for shader recompilation to finish.

Such stutter is generally speaking unwelcome in real-time rendering but it was particularly problematic for some applications that Vulkan was designed to handle, such as Virtual Reality (VR) where it can induce motion sickness. Pipeline layouts were thus introduced as a way for the application to specify the required metadata ahead of time, so that the GPU driver can compile the compute shader correctly at initialization time without any need for later recompilation.

Sadly, there is price to pay for this more precise control on the time at which GPU programs get compiled: we now need to repeat information available elsewhere in GLSL or Rust code at compute pipeline compilation time, which could fall out of sync with the rest of the program. Which is why vulkano provides a quick way to configure a pipeline layout with sensible default settings inferred from the SPIR-V code’s interface blocks:

use vulkano::pipeline::layout::PipelineDescriptorSetLayoutCreateInfo;

let layout_info =
    PipelineDescriptorSetLayoutCreateInfo::from_stages([&stage_info]);

In this introductory course, we will not need to deviate from this automatically generated configuration, because most of the non-default pipeline layout settings are targeted at Vulkan programs that have a resource binding performance bottleneck and that will not be our case.

However, what we can do is to introspect the resulting auto-configuration to quickly make sure that the GLSL interface is actually what our CPU-side Rust code expects:

use vulkano::descriptor_set::layout::DescriptorType;

assert_eq!(
    layout_info.set_layouts.len(),
    1,
    "This program should only use a single descriptor set"
);
let set_info = &layout_info.set_layouts[DATA_SET as usize];
assert_eq!(
    set_info.bindings.len(),
    1,
    "The only descriptor set should only contain a single binding"
);
let binding_info = set_info
    .bindings
    .get(&DATA_BINDING)
    .expect("The only binding should be at the expected index");
assert_eq!(
    binding_info.descriptor_type,
    DescriptorType::StorageBuffer,
    "The only binding should be a storage buffer binding"
);
assert_eq!(
    binding_info.descriptor_count, 1,
    "The only binding should only contain a single descriptor"
);
assert!(
    layout_info.push_constant_ranges.is_empty(),
    "This program shouldn't be using push constants"
);

As before, this is not necessary for our program to work but it increases its error-reporting capabilities in the face of desynchronization between the CPU-GPU interfaces declared on the CPU and GPU sides. Given how mind-boggling and hard-to-debug the symptoms of such desynchronization can otherwise be, a bit of defensive programming doesn’t hurt here.

Finally, once our paranoia is satisfied, we can proceed to build the compute pipeline layout:

use vulkano::pipeline::layout::PipelineLayout;

let layout_info = layout_info.into_pipeline_layout_create_info(device.clone())?;
let layout = PipelineLayout::new(device, layout_info)?;

Putting it all together, we get the following pipeline layout setup process:

use vulkano::{
    descriptor_set::layout::DescriptorType,
    pipeline::layout::{PipelineDescriptorSetLayoutCreateInfo, PipelineLayout},
}

/// Set up the compute pipeline layout
fn setup_pipeline_layout(
    device: Arc<Device>,
    stage_info: &PipelineShaderStageCreateInfo,
) -> Result<Arc<PipelineLayout>> {
    // Auto-generate a sensible pipeline layout config
    let layout_info = PipelineDescriptorSetLayoutCreateInfo::from_stages([stage_info]);

    // Check that the pipeline layout meets our expectation
    //
    // Otherwise, the GLSL interface was likely changed without updating the
    // corresponding CPU code, and we just avoided rather unpleasant debugging.
    assert_eq!(
        layout_info.set_layouts.len(),
        1,
        "This program should only use a single descriptor set"
    );
    let set_info = &layout_info.set_layouts[DATA_SET as usize];
    assert_eq!(
        set_info.bindings.len(),
        1,
        "The only descriptor set should only contain a single binding"
    );
    let binding_info = set_info
        .bindings
        .get(&DATA_BINDING)
        .expect("The only binding should be at the expected index");
    assert_eq!(
        binding_info.descriptor_type,
        DescriptorType::StorageBuffer,
        "The only binding should be a storage buffer binding"
    );
    assert_eq!(
        binding_info.descriptor_count, 1,
        "The only binding should only contain a single descriptor"
    );
    assert!(
        layout_info.push_constant_ranges.is_empty(),
        "This program shouldn't be using push constants"
    );

    // Finish building the pipeline layout
    let layout_info = layout_info.into_pipeline_layout_create_info(device.clone())?;
    let layout = PipelineLayout::new(device, layout_info)?;
    Ok(layout)
}

Compute pipeline

We have finally reached the end of this chapter, and all the pieces of our compute pipeline are now ready. A drop of glue code is all it will take to make them work together:

use crate::context::Context;
use vulkano::pipeline::compute::{ComputePipeline, ComputePipelineCreateInfo};

/// Number-squaring compute pipeline with associated layout information
#[derive(Clone)]
pub struct Pipeline {
    compute: Arc<ComputePipeline>,
    layout: Arc<PipelineLayout>,
}
//
impl Pipeline {
    // Set up a number-squaring pipeline
    pub fn new(context: &Context, options: &PipelineOptions) -> Result<Self> {
        let shader_module = setup_shader_module(context.device.clone(), options)?;
        let stage_info = setup_compute_stage(shader_module);
        let layout = setup_pipeline_layout(context.device.clone(), &stage_info)?;
        let pipeline_info = ComputePipelineCreateInfo::stage_layout(stage_info, layout.clone());
        let compute = ComputePipeline::new(
            context.device.clone(),
            Some(context.pipeline_cache()),
            pipeline_info,
        )?;
        Ok(Self { compute, layout })
    }
}

Notice that a struct-based setup is used so that the pipeline layout information is kept around after pipeline creation. We will need it again later, in order to bind resources to this pipeline.

As you may guess from the sight of the ComputePipelineCreateInfo struct, a few bits of Vulkan configurability that we do not need have been swept under the metaphorical rug here. The new settings available in this struct allow us to…

Build our compute pipelines without optimization. This is useful when aiming for faster application startup at the expense of runtime performance (e.g. debug builds) or when investigating a bug that might be affected by GPU compiler optimizations.
Mark a compute pipeline as a derivative of another, which might⁹ enable faster builds when building sets of compute pipelines that are closely related to each other, for example ones that only differ by specialization constant values.

Wrap-up

If you followed through all of this, congratulations! You now know about all the basic steps involved in the process of building a Vulkan compute pipeline. To summarize, you must…

Pick a shading language:
- GLSL for optimal community and tooling support.
- Other languages that compile to SPIR-V (Slang, rust-gpu…).
Write a shader, paying close attention to the CPU-GPU interface:
- Memory resources: buffers, images, …
- Workgroup size control & other specialization constants.
- Handling of out-of-bounds work items.
Get a SPIR-V shader module into your code:
- Compile the shader into SPIR-V.
- Load it into the program at compile- or run-time.
- Build a device-specific shader module.
- vulkano-shaders can do all these for you + check device requirements.
- We advise naming your interface IDs for clarity and maintainability.
Turn that SPIR-V shader into a pipeline stage:
- Apply specialization constants (can check CPU/GPU interface consistency).
- Select the shader entry point that we are going to use.
- Configure the pipeline stage.
Configure the pipeline’s layout:
- Consider PipelineDescriptorSetLayoutCreateInfo for simple programs.
- You can do MANY binding perf optimizations here, that we will not cover.
- You can also check CPU/GPU binding interface consistency here.
Build the compute pipeline.

In the next chapter, we will set up some memory resources, which will later bind to this pipeline.

Exercise

The Vulkan compute pipeline setup process has many parts to it, and as a result this chapter contained a lot of information. It is advisable to quickly review it and the matching vulkano documentation, making sure you have a decent understanding of what’s going on before proceeding with the rest of the course.

After doing so, please fill in the square.comp and square.rs files and modify the lib.rs file using the instructions provided at the end of each part of this chapter.

Then modify the bin/simulate.rs program so that it allows specifying a workgroup size and creating a compute pipeline, and finally give it a test run to make sure that the resulting program passes all runtime checks in addition to compile-time ones.

The C++ Language Extensions and C++ Language Support chapters of the CUDA programming guide should give you a quick taste of the challenges that are involved when taking a programming language that was designed for CPU programming and adapting it for GPU programming through a mixture of extensions and restrictions. And this is the most advanced attempt at such language adaptation, building on decades of effort from the richest GPU company in the world, and enjoying the luxury of only needing to support a single GPU vendor. As you can imagine, language adaptation projects that aim for cross-vendor portability with a more modest development team will have a harder time getting there. ↩
Ignoring a few emerging variations of the traditional graphics pipeline like raytracing and mesh shading, that may become the norm in the future if 1/all hardware in common use ends up supporting them and 2/they become undisputedly superior to the current vertex/tesselation/fragment graphics pipeline standard for all common rendering use cases. ↩
Graphics pipelines use block names to match interface blocks between pipeline stages, so that the shader associated with one pipeline stage can send data to the shader associated with the next pipeline stage. But compute pipelines only have a single stage, so this feature does not apply to them. ↩
The buffer memory layout rules used by Vulkan are actually a little different from those of struct members in C, which means that matching struct definitions on the CPU side must be written with care. But this is not an issue with our current buffer which contains just an array of floats (layed out as in C), and we will later see that vulkano-shaders makes such CPU/GPU data layout matching easy anyway. ↩
CUDA practicioners may also know work items as threads and workgroups as blocks. Generally speaking, the realm of GPU computing is very talented at coming up with many different names for the same thing, resulting in a confusing terminology mess. You may find the author’s inter-API glossary handy. ↩
Specialization constants are a powerful Vulkan feature, which leverages the fact that GPU programs are compiled just-in-time in order to let you modify some compilation constants within the SPIR-V program before it is compiled into a device binary. This allows you to set parameters that must be known at compile time (e.g. stack-allocated array sizes, workgroup sizes) from the CPU side, as well as to tell the GPU compiler about application parameters that are known to the CPU at startup time (e.g. CLI parameters) so the GPU compiler can specialize the output binary for these specific parameter values. ↩
This may sound like an overly small amount of work for a GPU work item, and to be fair it probably is. However, we must keep in mind that modern high-end GPUs contain many compute units that run in parallel, each of which executes SIMD instructions in a superscalar manner and leverages simultaneous multithreading for latency hiding. As a result, millions of work items are often required to fully utilize GPU resources. This is why the standard recommendation to GPU programmers, which we follow here, is to start by spawning as many work items as possible, and only later experiment with alternate configurations that spawn less work items that each handle a larger amount of work (which must be done with care due to GPU architecture details that we have no time to get into). Because GPU workgroup schedulers are very fast, it is expected that this optimization will only provide modest benefits in real-world GPU programs where each work item does more work than a single floating-point multiplication. ↩
This magical 64 factor comes from the fact that GPU workgroups hide SIMD and superscalar execution behind a common abstraction. If our workgroup size is not a multiple of the hardware SIMD width, then some SIMD lanes will be partially unused, resulting in reduced execution efficiency. The hardware SIMD width is almost always a power of two, and the largest SIMD width that is in common use on GPU hardware is the 64-wide wavefronts used by many AMD GPU architectures (GCN and older). ↩
Given the number of pessimistic comments in the documentation of prominent GPU vendors, this Vulkan feature looks like it was designed with very specific implementations in mind which are not the most common ones. Unless some vendor’s documentation explicitly told you to use it, it is probably in your best interest to ignore it. ↩

Resources

Following the work of the previous chapter, we now have a GPU compute pipeline that can be used to square an array of numbers. Before we can use it, however, we will need a second important thing, namely an array of numbers that can be bound to this pipeline.

In this chapter, we will see how such an array can be allocated, initialized, and bundled into a descriptor set that can in turn be bound to our compute pipeline. Along the way, we will also start covering how data can be exchanged between the CPU and the GPU, though our treatment of this topic will not be complete until the next chapter.

Vulkan memory primer

Barring (important) exceptions discussed in the memory profiling course, the standard CPU programming infrastructure is good at providing the illusion that your system contains only one kind of RAM that you can allocate with malloc() and liberate with free().

But Vulkan is about programming GPUs, which make different tradeoffs than CPUs in the interest of cramming more number-crunching power per square centimeter of silicon. One of them is that real-world GPU hardware can access different types of memory, which must be carefully used together to achieve optimal performance. Here are some examples:

High-performance GPUs typically have dedicated RAM, called Video RAM or VRAM, that is separate from the main system RAM. VRAM usually has ~10x higher bandwidth than system RAM, at the expense of a larger access latency and coarser data transfer granularity.¹
To speed up CPU-GPU data exchanges, some chunks of system RAM may be GPU-accessible, and some chunks of VRAM may be CPU-accessible. Such memory accesses must typically go through the PCI-express bus, which makes them very slow.² But for single-use data, in-place accesses can be faster than CPU-GPU data transfer commands. And such memory may also be a faster source/destination when data transfers commands do get involved.
More advanced algorithms benefit from cache coherence guarantees. These guarantees are expensive to provide in a CPU/GPU distributed memory setup, and they are therefore not normally provided by default. Instead, such memory must be explicitly requested, usually at the expense of reducing performance of normal memory accesses.
Integrated GPUs that reside on the same package as a CPU make very different tradeoffs with respect to the typical setup described above. Sometimes they only see a single memory type corresponding to system RAM, sometimes a chunk of RAM is reserved out of system RAM to reduce CPU-GPU communication. Usually these GPUs enjoy faster CPU-GPU communication at the expense of reduced GPU performance.

While some of those properties emerge from the use of physically distinct hardware, others originate from memory controller configuration choices that can be dynamically made on a per-allocation basis. Vulkan acknowledges this by exposing two related sets of physical device metadata, namely memory types and memory heaps:

A memory heap represents a pool of GPU-accessible memory out of which storage blocks can be allocated. It has a few intrinsic properties exposed as memory heap flags, and can host allocations of one or more memory types.
A memory type is a particular memory allocation configuration that a memory heap supports. It has a number of properties that affect possible usage patterns and access performance, some of which are exposed to Vulkan applications via memory property flags.

In vulkano, memory types and heaps can be queried using the memory_properties() method of the PhysicalDevice struct. This course’s basic info utility will display some of this information at device detail level 2 and above, while the standard vulkaninfo will display all of it at the expense of a much more verbose output. Let’s look at an abriged version of vulkaninfo’s output for the GPU of the author’s primary work computer:

vulkaninfo

[ ... lots of verbose noise ... ]

Device Properties and Extensions:
=================================
GPU0:
VkPhysicalDeviceProperties:
---------------------------
        apiVersion        = 1.4.311 (4210999)
        driverVersion     = 25.1.3 (104861699)
        vendorID          = 0x1002
        deviceID          = 0x6981
        deviceType        = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName        = AMD Radeon Pro WX 3200 Series (RADV POLARIS12)
        pipelineCacheUUID = a7ef6108-0550-e213-559b-1bf8cda454df

[ ... more verbose noise ... ]

VkPhysicalDeviceMemoryProperties:
=================================
memoryHeaps: count = 2
        memoryHeaps[0]:
                size   = 33607798784 (0x7d32e5000) (31.30 GiB)
                budget = 33388290048 (0x7c618e000) (31.10 GiB)
                usage  = 0 (0x00000000) (0.00 B)
                flags:
                        None
        memoryHeaps[1]:
                size   = 4294967296 (0x100000000) (4.00 GiB)
                budget = 2420228096 (0x9041c000) (2.25 GiB)
                usage  = 0 (0x00000000) (0.00 B)
                flags: count = 1
                        MEMORY_HEAP_DEVICE_LOCAL_BIT
memoryTypes: count = 7
        memoryTypes[0]:
                heapIndex     = 1
                propertyFlags = 0x0001: count = 1
                        MEMORY_PROPERTY_DEVICE_LOCAL_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                color images
                                FORMAT_D16_UNORM
                                FORMAT_D32_SFLOAT
                                FORMAT_S8_UINT
                                FORMAT_D16_UNORM_S8_UINT
                                FORMAT_D32_SFLOAT_S8_UINT
                        IMAGE_TILING_LINEAR:
                                color images
        memoryTypes[1]:
                heapIndex     = 1
                propertyFlags = 0x0001: count = 1
                        MEMORY_PROPERTY_DEVICE_LOCAL_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                None
                        IMAGE_TILING_LINEAR:
                                None
        memoryTypes[2]:
                heapIndex     = 0
                propertyFlags = 0x0006: count = 2
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                color images
                                FORMAT_D16_UNORM
                                FORMAT_D32_SFLOAT
                                FORMAT_S8_UINT
                                FORMAT_D16_UNORM_S8_UINT
                                FORMAT_D32_SFLOAT_S8_UINT
                        IMAGE_TILING_LINEAR:
                                color images
        memoryTypes[3]:
                heapIndex     = 1
                propertyFlags = 0x0007: count = 3
                        MEMORY_PROPERTY_DEVICE_LOCAL_BIT
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                color images
                                FORMAT_D16_UNORM
                                FORMAT_D32_SFLOAT
                                FORMAT_S8_UINT
                                FORMAT_D16_UNORM_S8_UINT
                                FORMAT_D32_SFLOAT_S8_UINT
                        IMAGE_TILING_LINEAR:
                                color images
        memoryTypes[4]:
                heapIndex     = 1
                propertyFlags = 0x0007: count = 3
                        MEMORY_PROPERTY_DEVICE_LOCAL_BIT
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                None
                        IMAGE_TILING_LINEAR:
                                None
        memoryTypes[5]:
                heapIndex     = 0
                propertyFlags = 0x000e: count = 3
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                        MEMORY_PROPERTY_HOST_CACHED_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                color images
                                FORMAT_D16_UNORM
                                FORMAT_D32_SFLOAT
                                FORMAT_S8_UINT
                                FORMAT_D16_UNORM_S8_UINT
                                FORMAT_D32_SFLOAT_S8_UINT
                        IMAGE_TILING_LINEAR:
                                color images
        memoryTypes[6]:
                heapIndex     = 0
                propertyFlags = 0x000e: count = 3
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                        MEMORY_PROPERTY_HOST_CACHED_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                None
                        IMAGE_TILING_LINEAR:
                                None

[ ... even more verbose noise, other GPUs ... ]

As you can see, this AMD Radeon WX 3200 GPU can access memory that is allocated from two memory heaps, that together support seven memory types:

The first memory heap corresponds to half of available of system RAM, and represents its GPU-accessible subset. It supports three memory types that are all visible from the CPU (HOST_VISIBLE) and coherent with CPU caches (HOST_COHERENT). The latter means, among other things, that when the CPU writes to these memory regions the change will eventually become GPU-visible without using any special command.
- Memory type 2 is not CPU-cached. This means that on the CPU side only sequential writes will perform well, but better CPU-to-GPU data transfer performance may be observed.
- Memory type 5 is CPU-cached, which improves CPU read and random access performance at the risk of increasing the performance penalty for GPU accesses.
- Memory type 6 is similar to memory type 5, but unlike the other two types it cannot be used for image allocations. Images are opaque memory objects used to leverage the GPU’s texturing units, which are beyond the scope of this introductory course.³
The second memory heap corresponds to the GPU’s dedicated VRAM, and comes with a DEVICE_LOCAL that indicates that it should be faster to access from the GPU. It supports four memory types that cover all possible combinations of the “can be read from the host/CPU” and “can be used for images” boolean truths.
- Memory type 0 is not host-visible and can be used for images.
- Memory type 1 is not host-visible and cannot be used for images.
- Memory type 3 is host-visible, host-coherent, and can be used for images.
- Memory type 4 is host-visible, host-coherent, and cannot be used for images.

You may be surprised by the way memory types are numbered, jumping from one memory heap to another. This ordering is unlikely to have been picked at random. Indeed, Vulkan requires that memory types be ordered by expected access performance, allowing applications to pick a good type with a simple “iterate over memory types and return the first one that fits the intended purpose” search loop. It is likely that this is part of⁴ what’s happening here.

In any case, now that we’ve gone through Vulkan memory heaps and types, let us start thinking about how our application might use them.

GPU data setup

Strategy

Our number-squaring program expects some initial data as input. Because this is a toy example, we could pick a simple input pattern that is easy to generate on the GPU (e.g. all-zero bytes).

But this is a special-purpose optimization as many real-world inputs can only come from the CPU side (think about e.g. inputs that are read from files). In the interest of covering the most general-purpose techniques, we will thus discuss how to get CPU-generated inputs into a GPU pipeline.

Depending on which Vulkan memory types are available, we may have up to three ways to perform this CPU-to-GPU data transfer:

Allocate a block of memory that is device-local and host-visible. Directly write to it on the CPU side, then directly read from it on the GPU side.
Allocate a block of memory that is NOT device-local but is host-visible. Use it as in #1.
Allocate a block of memory that is device-local and another block of memory that is host-visible. Write to the host-visible block on the CPU side, then use a Vulkan command to copy its content to the device-local block, then read from the device-local block on the GPU side.

How do these options compare?

The Vulkan specification guarantees that a host-visible and a device-local memory type will be available, but does not guarantee that they will be the same memory type. Therefore options #2 and #3 are guaranteed to be available, but option #1 may not be available.
Accessing CPU memory from the GPU as in option #2 may only be faster than copying it as in #3 if the data is only used once, or the GPU code only uses a subset of it. Thus this method only makes sense for GPU compute pipelines that have specific properties.
Given the above, although allocating two blocks of memory and copying data from one to the other as in #3 increases the program’s memory footprint and code complexity, it can be seen as the most general-purpose approach. Whereas alternative methods #1 and #2 can be more efficient in specific situations, and should thus be explored as possible optimizations when the memory copy of method #3 becomes a performance bottleneck.

We will therefore mainly focus on the copying-based method during this course.

CPU buffer

We mentioned earlier that buffers are the core Vulkan abstraction for allocating and using memory blocks with a user-controlled data layout is the buffer. But that was a bit of a logical shortcut. Several different Vulkan entities may actually get involved here:

Vulkan lets us allocate blocks of device-visible memory aka device memory.
Vulkan lets us create buffer objects, to which device memory can be bound. They supplement their backing memory with some metadata. Among other things this metadata tells the Vulkan implementation how we intend to use the memory, enabling some optimizations.
When manipulating images, we may also use buffer views, which are basically buffers full of image-like pixels with some extra metadata that describes the underlying pixel format.

As we have opted not to cover images in this course, we will not discuss buffer views further. But that still leaves us with the matter of allocating device memory and buffer with consistent properties (e.g. do not back a 1 MiB buffer with 4 KiB of device memory) and making that sure that a buffer never outlives the device memory that backs it at any point in time.

The vulkano API resolves these memory-safety issues by re-exposing the above Vulkan concepts through a stack of abstractions with slightly different naming:

RawBuffers exactly match Vulkan buffers and do not own their backing device memory. They are not meant to be used in everyday code, but rather to support advanced optimizations where the higher-level API does not fit. Using them requires unsafe operations.
A Buffer combines a RawBuffer with some backing device memory, making sure that the two cannot go out of sync in a manner that results in memory safety issues. It is the first memory-safe layer of the vulkano abstraction stack that can be used without unsafe.
A Subbuffer represents a subset of a Buffer defined by an offset and a size. It models the fact that most buffer-based Vulkan APIs also accept offset and range information, and again makes sure that this extra metadata is consistent with the underlying buffer object and device memory allocation. This is the object type that we will most often manipulate when manipulating buffers using vulkano.

By combining these abstractions with the rand crate for random number generation, we can create a CPU-visible buffer full of randomly generated numbers in the following manner:

use rand::{distr::Uniform, prelude::*};
use std::num::NonZeroUsize;
use vulkano::{
    buffer::{Buffer, BufferCreateInfo, BufferUsage, Subbuffer},
    memory::allocator::{AllocationCreateInfo, MemoryTypeFilter},
};

/// CLI parameters that guide input generation
#[derive(Debug, Args)]
pub struct InputOptions {
    /// Number of numbers to be squared
    #[arg(short, long, default_value = "1000")]
    pub len: NonZeroUsize,

    /// Smallest possible input value
    #[arg(long, default_value_t = 0.5)]
    pub min: f32,

    /// Largest possible input value
    #[arg(long, default_value_t = 2.0)]
    pub max: f32,
}

/// Set up a CPU-side input buffer with some random initial values
pub fn setup_cpu_input(context: &Context, options: &InputOptions) -> Result<Subbuffer<[f32]>> {
    // Configure the Vulkan buffer object
    let create_info = BufferCreateInfo {
        usage: BufferUsage::TRANSFER_SRC,
        ..Default::default()
    };

    // Configure the device memory allocation
    let allocation_info = AllocationCreateInfo {
        memory_type_filter: MemoryTypeFilter::PREFER_HOST | MemoryTypeFilter::HOST_SEQUENTIAL_WRITE,
        ..Default::default()
    };

    // Set up random input generation
    let mut rng = rand::rng();
    let range = Uniform::new(options.min, options.max)?;
    let numbers_iter = std::iter::repeat_with(|| range.sample(&mut rng)).take(options.len.get());

    // Put it all together by creating the vulkano Subbuffer
    let subbuffer = Buffer::from_iter(
        context.mem_allocator.clone(),
        create_info,
        allocation_info,
        numbers_iter,
    )?;
    Ok(subbuffer)
}

The main things that we specify here are that…

The buffer must be usable as the source of a Vulkan data transfer command.
The buffer should be allocated on the CPU side for optimal CPU memory access speed, in a way that is suitable for efficient sequential writes (i.e. uncached memory is fine).

But as you may imagine after having been exposed to Vulkan APIs for a while, there are many other things that we could potentially configure here:

On the BufferCreateInfo side, which controls creation of the Vulkan buffer object…
- We may make this a sparse buffer, which means that it can be backed by multiple non-contiguous blocks of device memory that can dynamically change over time.
- We may allow it to be used across multiple queue families, which is forbidden by default as it may have a performance cost.
- We may back it with memory allocated using other GPU APIs, which enables easier API interoperability.
On the AllocationCreateInfo side, which controls allocation of device memory…
- We may specify which Vulkan memory types should be used for the backing storage through a mixture of “must”, “should” and “should not” constraints.
- We may hint the allocator towards or away from using dedicated device memory allocations, as opposed to sub-allocating from previously allocated device memory blocks.

GPU buffer

Our input data is now stored in a memory region that the GPU can access, but likely with suboptimal efficiency. The next step in our copy-based strategy will therefore be to allocate another buffer of matching characteristics from the fastest available device memory type. After that we may use a Vulkan copy command to copy our inputs from the slow “CPU side” to the fast “GPU side”.

Allocating the memory is not very interesting in and of itself, as we will just use a different Buffer constructor that lets us allocate an uninitialized memory block:

/// Set up an uninitialized GPU-side data buffer
pub fn setup_gpu_data(context: &Context, options: &InputOptions) -> Result<Subbuffer<[f32]>> {
    let usage = BufferUsage::TRANSFER_DST | BufferUsage::STORAGE_BUFFER | BufferUsage::TRANSFER_SRC;
    let subbuffer = Buffer::new_slice(
        context.mem_allocator.clone(),
        BufferCreateInfo {
            usage,
            ..Default::default()
        },
        AllocationCreateInfo::default(),
        options.len.get() as u64,
    )?;
    Ok(subbuffer)
}

The only thing worth noting here is that we are using buffer usage flags that anticipate the need to later bind this buffer to our compute pipeline (STORAGE_BUFFER) and get the computations’ outputs back into a CPU-accessible buffer at the end using another copy command (TRANSFER_SRC).

As you will see in the next chapter, however, the actual copying will be a bit more interesting.

Descriptor set

After a copy from the CPU side to the GPU side has been carried out (a process that we will not explain yet because it involves concepts covered in the next chapter), the GPU data buffer will contain a copy of our input data. We will then want to bind this data buffer to our compute pipeline, before we can execute this pipeline to square the inner numbers.

However, because Vulkan is not OpenGL, we cannot directly bind a data buffer to a compute pipeline. Instead, we will first need to build a descriptor set for this purpose.

We briefly mentioned descriptor sets in the previous chapter. To recall their purpose, they are Vulkan’s attempt to eliminate a performance bottleneck that plagued earlier GPU APIs, where memory resources used to be bound to compute and graphics pipelines one by one just before scheduling pipeline execution. These numerous resource binding API calls often ended up becoming an application performance bottleneck,⁵ so Vulkan improved upon them in two ways:

The binding mechanism is batched, so that an arbitrarily large amount of resources (up to ~millions on typical hardware) can be bound to a GPU pipeline with a single API call.
Applications can prepare resource bindings in advance during their initialization stage, so that actual binding calls later perform as little work as possible.

The product of these improvements is the descriptor set, which is a set of resources that is ready to be bound to a particular compute pipeline.⁶ And as usual, vulkano makes them rather easy to build and safely use compared to raw Vulkan:

use vulkano::descriptor_set::{DescriptorSet, WriteDescriptorSet};

/// Set up a descriptor set for binding the GPU buffer to the compute pipeline
pub fn setup_descriptor_set(
    context: &Context,
    pipeline: &Pipeline,
    buffer: Subbuffer<[f32]>,
) -> Result<Arc<DescriptorSet>> {
    // Configure which pipeline descriptor set this will bind to
    let set_layout = pipeline.layout.set_layouts()[DATA_SET as usize].clone();

    // Configure what resources will attach to the various bindings
    // that this descriptor set is composed of
    let descriptor_writes = [WriteDescriptorSet::buffer(DATA_BINDING, buffer)];

    // Set up the descriptor set accordingly
    let descriptor_set = DescriptorSet::new(
        context.desc_allocator.clone(),
        set_layout,
        descriptor_writes,
        [],
    )?;
    Ok(descriptor_set)
}

As you may guess by now, the empty array that is passed as a fourth parameter to the DescriptorSet::new() constructor gives us access to a Vulkan API feature that we will not use here. That feature lets us efficiently copy resource bindings from one descriptor set to another, which improves efficiency and ergonomics in situations where one needs to build descriptor sets that share some content but differ in other ways.⁷

Another vulkano-supported notion that we will not cover further in this course is that of variable descriptor set bindings. This maps into a SPIR-V/GLSL feature that enables descriptor sets to have a number of bindings that is not defined at shader compilation time. That way, GPU programs to access an array of resources whose length can vary from one execution to another.

Output buffer

After some number squaring has been carried out (which, again, will be the topic of the next chapter), we could go on and perform more computations on the GPU side, without ever getting any data back to the CPU side until the very end (or never, if the end result is a real-time visualization).

This is good because CPU-GPU data transfers are relatively slow and can easily become a performance bottleneck. But here our goal is to keep our first program example simple, so we will just get data back to the CPU side right away.

For this purpose, we will set up a dedicated output buffer on the CPU side:

/// Set up an uninitialized CPU-side output buffer
pub fn setup_cpu_output(context: &Context, options: &InputOptions) -> Result<Subbuffer<[f32]>> {
    let create_info = BufferCreateInfo {
        usage: BufferUsage::TRANSFER_DST,
        ..Default::default()
    };
    let allocation_info = AllocationCreateInfo {
        memory_type_filter: MemoryTypeFilter::PREFER_HOST | MemoryTypeFilter::HOST_RANDOM_ACCESS,
        ..Default::default()
    };
    let subbuffer = Buffer::new_slice(
        context.mem_allocator.clone(),
        create_info,
        allocation_info,
        options.len.get() as u64,
    )?;
    Ok(subbuffer)
}

However, this may leave you wondering why we are not reusing the CPU buffer that we have set up earlier for input initialization. With a few changes to our BufferCreateInfo and AllocationCreateInfo, we could set up a buffer that is suitable for both purposes, but there is an underlying tradeoff. Let’s look into the pros and cons of each approach:

Using separate input and output buffers consumes twice the amount of GPU-accessible system memory compared to using only one buffer.
Using separate input and output buffers lets us set fewer BufferUsage flags on each buffer, which may enable the implementation to perform more optimizations.
Using separate input and output buffers lets us leverage uncached host memory on the input side (corresponding to vulkano’s MemoryTypeFilter::HOST_SEQUENTIAL_WRITE allocation configuration), which may enable faster data transfers from the CPU to the GPU.
And perhaps most importantly, using separate input and output buffers lets us check result correctness at the end, which is important in any kind of course material :)

Overall, we could have done it both ways (and you can experiment with the other way as an exercise). But in the real world, the choice between these two approaches will depend on your performance priorities (data transfer speed vs memory utilization) and what benefit you will measure from the theoretically superior dual-buffer configuration on your target hardware.

In any case, the actual copy operation used to get data from the GPU buffer to this buffer will be covered in the next chapter, because as mentioned above copy commands use Vulkan command submission concepts that we have not introduced yet.

Conclusion

In this chapter, we have explored how Vulkan memory management works under the hood, and what vulkano does to make it easier on the Rust side. In particular, we have introduced the various ways we can get data in and out of the GPU. And we have seen how GPU-accessible buffers can be packaged into descriptor sets for the purpose of later binding them to a compute pipeline.

This paves the way for the last chapter, where we will finally put everything together into a working number-squaring computation. The main missing piece that we will cover there is the Vulkan command submisson and synchronization model, which will allow us to perform CPU-GPU data copies, bind resources and compute pipelines, execute said pipelines, and wait for GPU work.

Exercises

As you have seen in this chapter, the topics of Vulkan resource management and command scheduling are heavily intertwined, and any useful Vulkan-based program will feature a combination of both. The code presented in this chapter should thus be considered a work in progress, and it is not advisable to try executing and modifying it at this point. We have not yet introduced the right tools to make sure it works and assess its performance characteristics.

What you can already do, however, is copy the functions that have been presented throughout this chapter into the exercises/src/square.rs code module, and add some InputOptions to the Options struct of exercises/src/square.rs so that you are ready to pass in the right CLI arguments later.

Then stop and think. Vulkan is about choice, there is never only one way to do something. What other ways would you have to get data in and out of the GPU? How should they compare? And how would they affect the resource allocation code that is presented in this chapter?

As a hint to check how far along you are, a skim through this chapter should already give you 4 ways to initialize GPU buffers, 4 ways to exploit the results of a GPU computation, and 2 ways to set up CPU staging buffers in configurations where copies to and from the GPU are required.

Of course, going through this thought experiment will not give you an exhaustive list of all possible ways to perform these operations (which would include specialized tools like buffer clearing commands and system-specific extensions like RDMA). But it should provide you with good coverage of the general-purpose approaches that are available on most Vulkan-supported systems.

This is partly the result of using a different memory technology, GDDR or HBM instead of standard DDR, and partly the result of GPUs having non-replaceable VRAM. The latter means that RAM chips can be soldered extremely close to compute chips and enjoy extra bandwidth by virtue of using a larger amount of shorter electrical connection wires. Several CPU models use a variation of this setup (Apple Mx, Intel Rapids series, …), but so far the idea of a general-purpose computer having its RAM capacity set in stone for its entire lifetime has not proved very popular. ↩
The “express” in PCI-express is relative to older versions of the PCI bus. This common CPU-GPU interconnect is unfortunately very low-bandwidth and high-latency by memory bus standards. ↩
A former version of this course used to leverage images because they make the GPU side of 2D/3D calculations nicer and enable new opportunities for hardware acceleration. But it was later discovered that this limited use of GPU texturing units is so much overkill that on many common GPUs it results in a net performance penalty compared to careful use of GPU buffers. Given that the use of images also adds a fair bit of complexity to the CPU-side setup code, this edition of the course decided to remove all uses of images in the joint interest of performance and CPU code simplicity. ↩
In this particular case, there is likely more to this story because the way AMD chose to enumerate their VRAM memory types means that no application allocation should ever end up using memory types 1 and 4. Indeed, these memory types can be used for buffers and not images, but they are respectively ordered after the memory types 0 and 3 that can be used for both buffers and images, and do not differ from types 1 and 4 in any other Vulkan-visible way. Buffer allocations using the “first memory type that fits” approach should thus end up using memory types 0 and 3 always. One possibility is that as with the “duplicate” queue families that we discussed before, there might be another property that distinguishes these two memory types, which cannot be queried from Vulkan but can be learned about by exploring manufacturer documentation. But at the time of writing, there is sadly no time for such an investigation, so we will leave this mystery for another day. ↩
Single-resource binding calls may seem reasonable at first glance, and are certainly good enough for the typical numerical computing application that only binds a couple of buffers per long-running compute pipeline execution. But the real-time 3D rendering workloads that Vulkan was designed for operate on tight real-time budgets (given a 60Hz monitor, a new video frame must be rendered every 16.7ms), may require thousands to millions of resource bindings, and involve complex multipass algorithms that may require resource rebinding between passes. For such applications, it is easy to see how even a small per-binding cost in the microsecond range can baloon up into an unacceptable amount of API overhead. ↩
To be precise, descriptor sets can be bound to any pipeline that has the same descriptor set layout. Advanced Vulkan users can leverage this nuance by sharing descriptor set layouts or even entire pipeline layouts across several compute and graphics pipelines. This allows them to amortize the API overheads of pipeline layout setup, but most importantly reduces the need to later set up and bind redundant descriptor sets when the same resources are bound to several related compute and graphics pipelines. ↩
If the pipeline executions that share some bindings run in succession, a more efficient alternative to this strategy is to extract the shared subset of the original descriptor set into a different descriptor set. This way, you can keep the descriptor set that corresponds to the common subset of bindings bound, and only rebind the descriptor sets that correspond to bindings that do change. ↩

Execution

It has been quite a long journey, but we are now reaching the end of our Vulkan tour!

In the last chapter, we have supplemented the GPU compute pipeline that we have set up previously with a set of memory resources. We can use those to initialize our dataset on the CPU side, move it to the fastest kind of GPU memory available, bind that fast memory to our compute pipeline, and bring results back to the CPU side once the computation is over.

There is just one missing piece before we turn these various components into a fully functional application. But it is quite a big one. How do we actually ask the GPU to perform time-consuming work, like copying data or executing compute pipelines? And how do we know when that work is over? In other words, how do we submit and synchronize with GPU work?

In this chapter, we will finally answer that question, which will allow us to put everything together into a basic GPU application, complete with automated testing and performance benchmarks.

Vulkan execution primer

Problem statement

Because they are independent pieces of hardware, CPUs and GPUs have a natural ability to work concurrently. As the GPU is processing some work, nothing prevents the CPU from doing other work on its side. Which can be useful even in “pure GPU” computations where that CPU work serves no other purpose than to collect and save GPU results or prepare the execution of more GPU work.

Competent GPU programmers know this and will leverage this concurrent execution capability whenever it can help the application at hand, which is why synchronous GPU APIs that force the CPU to wait for the GPU when it doesn’t need to are a performance crime. In the world of GPU programming, any GPU command that can take a nontrivial amount of time to process, and is expected to be used regularly throughout an application’s lifetime, should be asynchronous.

But experience with Vulkan’s predecessor OpenGL, whose implementors already understood the importance of asynchronism, revealed that there is a lot more to GPU command execution performance than making every major API command asynchronous:

Sending commands from the CPU to the GPU comes at a significant cost. If the API does not allow applications to amortize this cost by sending commands in batches, then GPU drivers will have to batch them on their own, resulting in unpredictable delays between the moment where applications call API commands and the moment where they start executing.
Scheduling GPU work also involves some CPU work, which can accumulate into significant overhead in applications that need lots of short-running GPU commands. Distributing this CPU work across multiple CPU threads could give applications more headroom before it becomes a performance bottleneck… if OpenGL’s global state machine design did not get in the way.
Assuming several CPU threads do get involved, funneling all commands through a single command submission interface can easily become a bottleneck. So GPU hardware provides multiple submission interfaces, but the only way to use them with zero thread contention is for GPU APIs to expose them in such a way that each CPU thread can get one.
Once we accept the idea of multiple submission interfaces, it is not that much of a stretch to introduce specialized submission interfaces for commands that have a good chance to execute in parallel with other ones. This way, GPU hardware and drivers do not need to look far ahead in the command stream to find such commands, then prove that executing them earlier than expected won’t break the application’s sequential command execution logic.
Speaking of sequential command execution, the promise made by older GPU APIs that GPU commands will execute one after another without observable overlap is fundamentally at odds with how modern pipelined and cache-incoherent GPU hardware works. Maintaining this illusion requires GPU drivers to automatically inject many pipeline and memory barriers, causing reduced hardware utilization and some CPU overhead. It would be great if applications could control this mechanism to selectively allow the kinds of command execution overlap and cache incoherence that do not affect their algorithm’s correctness.
But there is more to GPU work synchronization than command pipelining control. Sometimes, CPU threads do need to wait for GPU commands to be done executing some work. And now that we have multiple channels for GPU work submission, we also need to think about inter-command dependencies across those channels, or even across multiple GPUs. OpenGL provided few tools to do this besides the glFinish() sledgehammer which waits for all previously submitted work to complete, thus creating a humonguous GPU pipeline bubble while imposing a lot more waiting than necessary on the CPU. For any application of even mild complexity, finer-grained synchronization would be highly desirable.

What Vulkan provides

Building upon decades of OpenGL application and driver experience, Vulkan set out to devise a more modern GPU command submission and synchronization model that resolves all of the above problems, at the expense of a large increase in conceptual complexity:

Commands are not directly submitted to the GPU driver, but first collected in batches called command buffers (solving problem #1). Unlike in OpenGL, recording a command buffer does not involve modifying any kind of global GPU/driver state, so threads can easily record command buffers in parallel (solving problem #2).
Multiple hardware command submission channels are exposed in the API via queues (solving problem #3), which are grouped into queue families to express specialization towards particular kinds of work that is more likely to execute in parallel (solving problem #4).
GPUs may overlap command execution in any way they like by default, without enforcing a consistent view of GPU memory across concurrent commands. Applications can restrict this freedom whenever necessary by inserting pipeline and memory barriers between two commands within a command buffer (solving problem #5).
While Vulkan still has a device-wide “wait for idle” operation, which is supplemented by a threading-friendly queue-local version, it is strongly discouraged to use such synchronization for any other purpose than debugging. Finer-grained synchronization primitives are used instead for everyday work (solving problem #6):
- Fences let CPU code wait for a specific batch of previously submitted commands. They are specialized for everyday “wait for CPU-accessible buffers to be filled before reading them” scenarios, and should be the most efficient tool for these use cases.
- Events allow GPU commands within a queue to await a signal that can be sent by a previous command within the same queue (as a more flexible but more expensive alternative to the barriers discussed above) or by host code.
- Semaphores provide maximal flexibility at the expense of maximal overhead. They can be signaled by the host or by a GPU command batch, and can be awaited by the host or by another GPU command batch. They are the only Vulkan synchronization primitive that allows work to synchronize across GPU queues without CPU intervention.

The `vulkano` layer

As you can imagine, all this new API complexity can take a while to master and has been a common source of application correctness and performance bugs. This proved especially true in the area of command pipelining, where Vulkan’s “allow arbitrary overlap and incoherence by default” strategy has proven to be a remarkably powerful source of application developer confusion.

Other modern performance-oriented GPU APIs like Apple Metal and WebGPU have thus refused to follow this particular path. Their design rationale was that even though the latency increase caused by forbidding observable command overlap cannot be fully compensated, most of the associated throughput loss can be compensated by letting enough commands execute in parallel across device queues, and for sufficiently complex applications that should be good enough.

But that is forgetting a bit quickly that Vulkan is about choice. When the GPU API provides you with the most performant but least ergonomic way to do something, nothing prevents you from building a higher-level layer on top of it that improves ergonomics at the expense of some performance loss. Whereas if you start from a higher-level API, making it lower-level to improve performance at the expense of ergonomics can be impossible. This is why good layered abstraction design with high-quality low-level layers matter, and in this scheme Vulkan was designed to be the ultimate low-level layer, not necessarily a high-level layer that all applications should use directly.

In the case of Vulkan programming in Rust, we have vulkano for this purpose, and in this area like others it delivers as expected:

As a default choice, the high-level AutoCommandBufferBuilder layer implements a simple Metal-like command queuing model. It should provide good enough performance for typical numerical computing applications, with much improved ergonomics over raw Vulkan.
If you ever face a performance problem that originates from the resulting lack of GPU command overlap, or from the overhead of CPU-side state tracking (which automatic barrier insertion entails), that is not the end of the world. All you will need to do is to reach for the lower-level RecordingCommandBuffer unsafe layer, and locally face the full complexity of the Vulkan command pipelining model in the areas of your applications that need it for performance. The rest of the application can remain largely untouched.

For the purpose of this course, we will not need lower-level control than what the high-level safe vulkano layer provides, so the remainder of this chapter will exclusively use that layer.

Command buffer

As mentioned above, Vulkan requires any nontrivial and potentially recuring GPU work to be packaged up into a command buffer before it can be submitted to the GPU for execution. In our first number-squaring example, the part of the work that qualifies as command buffer worthy is…

Copying CPU-generated inputs to the fastest available kind of GPU memory
Binding the compute pipeline so that future execution (dispatch) commands refer to it
Binding the fast GPU buffer descriptor set so that the compute pipeline uses it
Executing (dispatching) the compute pipeline with a suitable number of workgroups
Copying the output back to CPU-accessible memory

…and using vulkano, we can build a command buffer that does this as follows:

use vulkano::{
    command_buffer::{
        auto::{AutoCommandBufferBuilder, PrimaryAutoCommandBuffer},
        CommandBufferUsage, CopyBufferInfo,
    },
    pipeline::PipelineBindPoint,
};

/// Build a command buffer that does all the GPU work
pub fn build_command_buffer(
    context: &Context,
    cpu_input: Subbuffer<[f32]>,
    gpu_data: Subbuffer<[f32]>,
    gpu_pipeline: Pipeline,
    gpu_data_desc: Arc<DescriptorSet>,
    pipeline_options: &PipelineOptions,
    cpu_output: Subbuffer<[f32]>,
) -> Result<Arc<PrimaryAutoCommandBuffer>> {
    // Set up a primary command buffer
    let mut builder = AutoCommandBufferBuilder::primary(
        context.comm_allocator.clone(),
        context.queue.queue_family_index(),
        CommandBufferUsage::OneTimeSubmit,
    )?;

    // Copy CPU inputs to the GPU side
    builder.copy_buffer(CopyBufferInfo::buffers(cpu_input, gpu_data.clone()))?;

    // Bind the compute pipeline for future dispatches
    builder.bind_pipeline_compute(gpu_pipeline.compute)?;

    // Bind memory to the compute pipeline
    builder.bind_descriptor_sets(
        PipelineBindPoint::Compute,
        gpu_pipeline.layout,
        DATA_SET,
        gpu_data_desc,
    )?;

    // Execute the compute pipeline with an appropriate number of work groups
    let num_work_groups = cpu_output
        .len()
        .div_ceil(pipeline_options.workgroup_size.get() as u64);
    // SAFETY: GPU shader has been checked for absence of undefined behavior
    //         given a correct execution configuration, and this is one
    unsafe {
        builder.dispatch([num_work_groups as u32, 1, 1])?;
    }

    // Retrieve outputs back to the CPU side
    builder.copy_buffer(CopyBufferInfo::buffers(gpu_data, cpu_output))?;

    // Finalize the command buffer object
    Ok(builder.build()?)
}

As usual, a few things should be pointed out about this code:

That’s a lot of function parameters! Which comes from the fact that this function asks the GPU to do many different things. Our example code is written that way because it allows us to introduce Vulkan concepts in a more logical order, but real-world Vulkan apps would benefit from spreading the command recording process across more functions that each take an &mut AutoCommandBufferBuilder as a parameter.
- Generally speaking, functions should favor &mut AutoCommandBufferBuilder over building command buffers internally until you are ready to submit work to the GPU. This allows you to pack your GPU work into as few command buffers as possible, which may improve command execution efficiency.¹
We are building a primary command buffer, which can be directly submitted to the GPU. This is in contrast with secondary command buffers, which can be inserted into primary command buffers. The latter can be used to avoid repeatedly recording recurring commands,² and it also combines really well with a Vulkan graphics rendering feature called render passes that falls outside of the scope of this compute-focused introductory course.

Execution

The high-level vulkano API generally tries to stay close to the underlying Vulkan C API, using identical concepts and naming. Deviations from raw Vulkan must be motivated by the desire to be memory-, type- and thread-safe by default, in line with Rust’s design goals. This is good as it makes it easier to take documentation about the Vulkan C API and apply it to vulkano-based programs.

However, there is one area where vulkano’s high-level API strays rather far from the concepts of its Vulkan backend, and that is command execution and synchronization. This makes sense because that part of the vulkano API needs to guard against safety hazards related to GPU hardware asynchronously reading from and writing to CPU-managed objects, which is quite difficult.³

Instead of closely matching the Vulkan functions used for command buffer submission (like vkSubmit()) and synchronization (like vkWaitForFences()), the high-level vulkano API for command submission and synchronization therefore currently⁴ works as follows:

Prepare to send a PrimaryAutoCommandBuffer to a Queue using its execute() method. This produces a CommandBufferExecFuture, which is a special kind of GpuFuture. GPU future objects model events that will eventually occur, in this case the moment where commands within this particular command buffer will be done executing.
- It is very important to understand that at this point, the command buffer has not been sent to the GPU. Indeed, vulkano must expose the fact that Vulkan lets us send multiple command buffers to the GPU with a single vkSubmit() operation in order to keep performance competitive against direct use of the Vulkan C API.
- Other GPU future objects are available, representing things like the signal that our system is ready to render a new frame in real-time graphics.
Chain as many of these futures as desired using methods like then_execute() and join(), which respectively represent sequential and concurrent execution. This will produce a GpuFuture object that represents an arbitrarily complex graph of GPU tasks that may or may not be linked by sequential execution dependencies.
Indicate points where Vulkan synchronization objects (fences and semaphores) should be signaled using GPU future methods like then_signal_fence() and then_signal_semaphore().
Submit all previously scheduled work to the GPU whenever desired using the flush() method. This does not destroy the associated GPU future object so that you can keep scheduling more work after the work that has just started executing.
- One will often want to do this at points of the asynchronous task graph that are observable through signaling of a Vulkan synchronization object, which is why vulkano provides then_signal_fence_and_flush() and then_signal_semaphore_and_flush() API shortcuts for this use case.
When the CPU momentarily runs out of work to submit to the GPU, await completion of some previous work using the wait() method of the particular GPU future type that is returned by the then_signal_fence() operation. Because making a CPU wait for GPU work for an unbounded amount of time is bad, polling and timeout options are also available here.

In the context of our number-squaring example, we can use this API as follows:

use vulkano::{
    command_buffer::PrimaryCommandBufferAbstract,
    sync::future::GpuFuture,
};

/// Synchronously execute the previously prepared command buffer
pub fn run_and_wait(context: &Context, commands: Arc<PrimaryAutoCommandBuffer>) -> Result<()> {
    commands
        .execute(context.queue.clone())?
        .then_signal_fence_and_flush()?
        .wait(None)?;
    Ok(())
}

Notice we need to bring a few traits into scope in order to use the execute() method (whose implementation is shared across all kinds of primary command buffers using the PrimaryCommandBufferAbstract trait) and the GpuFuture trait’s methods (whose implementation is also shared across all kinds of GPU futures.

Exercises

Final executable

Now that we have all parts of the computation ready, we can put them all together into a complete program by adding all of the above functions to exercises/src/square.rs, then rewriting the exercises/src/bin/square.rs binary’s source into the following:

use clap::Parser;
use grayscott_exercises::{
    context::{Context, ContextOptions},
    square::{self, Pipeline, PipelineOptions},
    Result,
};

/// This program generates and squares an array of numbers
#[derive(Parser, Debug)]
#[command(version, author)]
struct Options {
    /// Vulkan context configuration
    #[command(flatten)]
    context: ContextOptions,

    /// Compute pipeline configuration
    #[command(flatten)]
    pipeline: PipelineOptions,

    /// Input data configuration
    #[command(flatten)]
    input: InputOptions,
}

fn main() -> Result<()> {
    // Decode CLI arguments
    let options = Options::parse();

    // Set up a generic Vulkan context
    let context = Context::new(&options.context)?;

    // Set up a compute pipeline
    let pipeline = Pipeline::new(&context, &options.pipeline)?;

    // Set up memory resources
    let cpu_input = square::setup_cpu_input(&context, &options.input)?;
    let gpu_data = square::setup_gpu_data(&context, &options.input)?;
    let gpu_data_desc = square::setup_descriptor_set(
        &context,
        &pipeline,
        gpu_data.clone(),
    )?;
    let cpu_output = square::setup_cpu_output(&context, &options.input)?;

    // Build a command buffer
    let commands = square::build_command_buffer(
        &context,
        cpu_input.clone(),
        gpu_data,
        pipeline,
        gpu_data_desc,
        &options.pipeline,
        cpu_output.clone(),
    )?;

    // Synchronously execute the command buffer
    square::run_and_wait(&context, commands)?;

    // Check computation results
    let cpu_input = cpu_input.read()?;
    let cpu_output = cpu_output.read()?;
    assert!(
        cpu_input
            .iter()
            .zip(&cpu_output)
            .all(|input, output| output == input.powi(2))
    );
    println!("All numbers have been squared correctly!");
    Ok(())
}

Notice the use of the Subbuffer::read() method at the end, which is needed in order to allow vulkano to check for absence of data races between the CPU and the GPU.

Execute this binary in debug mode, then in release mode, while measuring execution times:

cargo build --bin square >/dev/null  \
&& time cargo run --bin square  \
&& cargo build --release --bin square >/dev/null  \
&& time cargo run --release --bin square

Do you understand the outcome? If not, think about what you are measuring for a bit to see if you can figure it out for yourself, before moving to the next part.

Benchmarks

While timing complete programs like square can be nice for early performance exploration, as soon as you start getting into the realm of performance optimization it is good to figure out which part of the program is most critical to performance and measure it separately.

A benchmark has been set up to this end in exercises/benches/square.rs. Once you are done with the above exercise, it should compile and run. Try to run it with cargo bench --bench square, and use it to determine…

Which parts of the process are slowest on the various devices exposed by your system.
How their performance scales with the various tunable parameters.

Note that the cargo bench command also accepts a regular expression argument that can be used to only run selected benchmarks. It can be used like this:

cargo bench --bench square -- "(input1000|workgroup64)/"

These benchmarks are pretty long-running so…

If you are running them on a laptop, plug in the power adapter
If you are running them on an HPC center, run them on a worker node
Do not hesitate to stop them ahead of time once you have seen what you want to see, then re-run them in a more restricted configuration.

One thing you can also to is to modify this program so that build_command_buffer schedules multiple compute shader dispatches instead of one (i.e. instead of only squaring numbers, you elevate them to the power of 2, 4, 8, 16…). Note that this does not require you to repeatedly re-bind the compute pipeline and data descriptor set.

Modify the program to measure performance at various numbers of compute shader dispatches. What do you observe, and what does this tell you about the most likely GPU-side performance bottleneck(s) of this number-squaring program?

Optimizations

This last exercise may take you a significant amount of time and should only be worked on if you finished well ahead of the expected end of the course.

Get back your list of CPU-GPU data handling strategies from the exercise of the last chapter, implement one of them (preferably one of those that are easy to implement), and use the benchmark to measure how it affects execution performance.

Ignoring for a moment the question of how expensive it is to create and build a command buffer (cheap but not free as usual), some Vulkan implementations are known to insert pipeline barriers between consecutive command buffers, most likely in an attempt to simplify the implementation of synchronization primitives that operate at command buffer boundaries like fences. ↩
…but reusing command buffers like this comes at the cost of losing some GPU driver optimizations. It is therefore recommended that applications first attempt to resolve command recording bottlenecks by spreading the command recording load across multiple CPU threads, before falling back to secondary command buffers if that is not enough. ↩
Rust’s compile-time safety proofs build on the fact that the compiler has full knowledge of the program execution and data flow, which is not true in the presence of asynchronous GPU work execution. This problem is also encountered when designing CPU threading APIs, where it is commonly restored by introducing synchronization points that wait for all tasks to be finish executing and release data references (a design known as fork-join or structured concurrency). But as previously discussed, such synchronization points are unwelcome in GPU programming, which is why vulkano has instead gone for a run-time safety tracking mechanism that was largely custom-built for its use case. ↩
The GpuFuture abstraction has a number of known flaws, and there is work ongoing in the vulkano-taskgraph crate with the aim of eventually replacing it. But that work is not ready for prime time yet. ↩

Gray-Scott introduction

After going through a first tour of Vulkan on a simple number-squaring problem, it is time for us to take a step up in complexity and go for a full Gray-Scott reaction simulation.

To this end, a copy of the first (unoptimized) version of the CPU simulation has been copied into the course’s codebase. And throughout most of this chapter, we will see how our current CPU simulation code and GPU infrastructure can be modified in order to get a first GPU simulation.

After that, if time permits, we will also see what kind of changes we can make to this basic GPU program in order to make the simulation more efficient.

Instance & Context

Earlier, I mentioned that it is possible to share lots of Vulkan context-building code between your applications, provided that you are ready to add extra configuration points to the context-building code whenever necessary in order to accomodate new needs.

In this chapter, we will provide a first example by adding a new configuration point to our instance-building code. This is necessary because in the main Gray-Scott simulation binary, we are using an indicatif-based textual progress bar. Which is good for user experience, but bad for developer experience, as it breaks one of the most powerful of all debugging tools: println!().¹

Mixing `indicatif` with console output

Thankfully, the authors of indicatif are aware of the great sacrifice that fellow developers have to make when they use this library, so they tried to ease the pain by providing a println() method on the ProgressBar object that eases the migration of code that previously used println!().

However this method is not quite enough for our purposes, as we would like to follow Unix convention by sending our log messages to stderr, not stdout. So we will instead go for its more powerful cousin, the suspend() method. Its callback-based design lets us execute arbitrarily complex text output code and more generally use stdout and stderr in any way we like, without suffering the progress bar visual corruption that could otherwise ensue:

progress_bar.suspend(|| {
    println!("Can use stdout here...");
    eprintln!("...and stderr too...");
    println!(
        "...all that without losing access to {:?}, or needing to allocate strings",
        "the println mini-language"
    );
});

We can then leverage the fact that ProgressBar uses an Arc-like cloning model, which means that we can make as many clones of the initial ProgressBar object as we need, send them anywhere needed, and all resulting ProgressBar objects will operate on the same progress bar.

And by combining these two aspects of indicatif’s API, we can devise a strategy that will give us back a correct terminal display of Vulkan logs with minimal effort:

Send a copy of the ProgressBar to any code that needs to do some text output. If we are feeling extremely lazy, we could even make it a global variable, as we’re unlikely to ever need multiple progress bars or progress bar-related tests.
In the code that does the text output, wrap all existing text output into suspend().
Repeat the process every time new text output needs to be added.

Exercise

This is actually the only part of our Vulkan instance- and context- building code that needs to change in order to accomodate the needs of our Gray-Scott reaction simulation.

From the above information, we can infer a reasonably small code refactor that eliminates all risks of inelegant progress bar visual corruption:

Add a new Option<ProgressBar> parameter to the logger_info() function in exercises/src/instance.rs, with the following semantics.
- If this parameter is None, then there is no progress bar and we can just send output to the console directly the way we did before.
- If it is Some(progress_bar), then wrap our Vulkan logging into a progress_bar.suspend(move || { /* ... logging goes here ... */ }) callback.
Modify callers of logger_info()² in order to give this parameter an appropriate value.
- In the beginning, you will just want to add a similar extra parameter to the caller function, so that it also takes such a parameter and simply passes it down, repeating the process as many times as necessary until…
- …at some point you will reach a top-level binary or benchmark that does not need to use indicatif. You will then be able to stop “bubbling up” optional parameters as described above, and instead simply pass None as an argument.
- You will notice that examples/benches/simulate.rs does not need adjustments here (and does not compile yet). This is expected, that benchmark is pre-written in such a way that it will be valid by the time you reach the end of this section of the course.
Finally, modify examples/bin/simulate.rs so that it sets up a LoggingInstance in an appropriate manner[^3]. For now, do not try to wire this object down through the rest of the Gray-Scott simulation. Just leave it unused and ignore the resulting compiler warning.

Please specify below if you have used Rust before this course:

To be able to follow step 1, you will need a language feature known as pattern matching. We have not covered it in this course due to lack of time and conflicting priorities, but here is a simple code example that should give you a good starting point:

#![allow(unused)]
fn main() {
fn option_demo(value: Option<String>) -> Option<String> {
    // Can check if an Option contains Some or None nondestructively...
    if let Some(x) = &value {
        println!("Received a string value: {x}");
    } else {
        println!("Did not receive a value");
    }

    // ...in the sense that if `&` is used as above, `x` is not a `String` but
    // a `&String` reference, and therefore the above code does not move
    // `value` away and we can still use it.
    value
}
}

Still at step 1, you will also need to know that by default, anonymous functions aka lambdas capture surrounding variables from the environment by reference, and you need to add the move keyword to force them to capture surrounding variables by value:

#![allow(unused)]
fn main() {
let s = String::from("Hello world");

// This function captures s by reference
let f_ref = || println!("{s}");
f_ref();
// ...so s can still be used after this point...

// This one captures it by value i.e. moves it...
let f_mv = move || println!("{s}");
f_mv();
// ...which means s cannot be used anymore starting here
}

At step 3, you will run into trouble with a function that returns an hdf5::Result. This result type is not general-purpose anymore, as it can only contain HDF5 errors whereas we now also need to handle Vulkan errors. Replacing this specialized HDF5 result type it with the more general grayscott_exercises::Result type will resolve the resulting compiler error.

Once you are done with the above refactoring, proceed to modify the Context::new() constructor to also support this new feature.

Then change examples/bin/simulate.rs to create a context instead of a raw instance, and adjust any other code that calls into Context::new() as needed.

While doing so, you will likely find that you need to adjust the Gray-Scott simulation CLI arguments defined at exercises/src/grayscott/options.rs, in order to let users of the simulate binary configure the Vulkan context creation process on the command line, much like they already can when running the square binary that we worked on in the previous course section.

And that will be it. For the first version of our Vulkan-based Gray-Scott reaction simulation, we are not going to need any other change to the Context and Instance setup code.

…and print!(), and eprintln!()… basically any kind of textual application output over stdout and stderr will break indicatif’s progress bar rendering along with that of any other kind of live terminal ASCII art that you may think of, which is a great shame. ↩
On the Linux/macOS command line, you can find these by calling the grep logger_info command at the root of the exercises/ directory. ↩

Pipelines

Unlike the GPU context building code, which is rather generic, our previous GPU pipeline was specific to the number-squaring task at hand. So it will take quite a few changes to our pipeline-building procedure before we get to a working Gray-Scott reaction simulation.

Code module

To avoid making our code too complicated for Rust beginners, we will not attempt to fully deduplicate pipeline setup code between the number-squaring and Gray-Scott computations.

Instead, we will create a new dedicated pipeline code module inside of our grayscott module, within which we will copy and paste relevant code from the square pipeline as appropriate.

This can be done by going through the following steps:

Add a pub mod pipeline item to the exercises/src/grayscott/mod.rs file, which represents the root of the Gray-Scott reaction specific code.
Create an exercises/src/grayscott/pipeline.rs empty file file inside of this directory, which will host compute pipeline handling code.

Common GLSL infrastructure

Data interface

After the mandatory GLSL version declaration…

#version 460

…we need to think a bit about what our CPU-GPU data interface is going to look like.

We want to have two input buffers, from which we are going to read data representing the initial concentration of the U and V chemical species.
We want to have two output buffers, into which we are going to write updated values of the chemical species concentrations.
These 4 buffers will always be re-bound together (initially in an alternating pattern), and can thus be grouped into a single descriptor set for efficiency.

This data interface is well expressed by the following GLSL code…

// Input and output data buffers
layout(set = 0, binding = 0) restrict readonly buffer InputBuffer {
    float[] data;
} Input[2];
layout(set = 0, binding = 1) restrict writeonly buffer OutputBuffer {
    float[] data;
} Output[2];

// Indexing convention for Input and Output arrays
const uint U = 0;
const uint V = 1;

…which leverages a few new GLSL features that we have not used so far:

Like C pointers, GLSL data interfaces can be annotated with the restrict keyword. This allows the compiler to assume that they are the only way to read or write the data of interest, resulting in improved optimizations when matters like SIMD get involved (which is the case on all GPU hardware in common use).
GLSL replaces the C/++ const notion with a more symmetrical scheme based on read and write access. In particular, we can declare buffers as readonly and writeonly when that is our intent, to make sure that using them otherwise becomes a compilation error.
Because GLSL inherits the C/++ struct limitation of only allowing one flexible array member at the end of a shader storage block, we cannot e.g. model our inputs as a buffer containing both float[] u; and float[] v;. We resolve this with arrays of storage blocks, coupled with helpful consts that let us tell which index of these array represents which chemical species.
- “But”, you may say, “didn’t you just say that GLSL does not have C-style const?”. Indeed, that is the case. The const keyword in GLSL is for compilation constants that are inlined at the point of use, like const in Rust and static constexpr in C++.

Specialization constants

Now, if you remember the previous example of GLSL code that we have gone through, you will know that we will also want a set of specialization constants, for two different reasons:

It improves code maintainability by reducing the volume of information that we need to duplicate on the CPU and GPU side of the interface. Such duplication creates a risk that information goes out of sync as we modify either side, resulting in tricky program bugs.
It allows us to configure the GPU code from the CPU side right at the time where the GPU code is compiled (at application startup time), and thus allows us to…
1. Avoid run-time configuration, which is a little more cumbersome in Vulkan than in other GPU APIs because the underlying evil hardware mechanics are not hidden.
2. Benefit from compiler optimizations that leverage knowledge of all application parameters, so that our GPU code gets specialized for this set of parameters.

What specialization constants are we going to need then?

As before, we will want to have a way to set the execution workgroup size. And because our problem is now two-dimensional, we will want to allow two-dimensional workgroups in addition to one-dimensional ones, as those may come in handy.
Vulkan storage buffers are one-dimensional,¹ so right now we only know the total number elements of our 2D chemical species tables, and not their aspect ratio. We’ll need this information to correctly perform our stencil computation, so if we don’t hardcode it into the shader, we’ll need to pass it from the CPU side to the GPU side somehow.
Our computation has a number of parameters that are known either at Rust code compilation time or at GPU pipeline building time, which from the perspective of GPU programming makes no difference. Specialization constants can also be used to pass such constants to the GPU compiler for the sake of making it know as much as possible.

We can encode this set of specializations in GLSL as follows:

// Configurable workgroup size, as before
layout(local_size_x = 8, local_size_y = 8) in;
layout(local_size_x_id = 0, local_size_y_id = 1) in;

// Concentration table width
layout(constant_id = 2) const uint UV_WIDTH = 1920;

// "Scalar" simulation parameters
layout(constant_id = 3) const float FEED_RATE = 0.014;
layout(constant_id = 4) const float KILL_RATE = 0.054;
layout(constant_id = 5) const float DELTA_T = 1.0;

// 3x3 Laplacian stencil
//
// Unfortunately, SPIR-V does not sûpport setting matrices via specialization
// constants at this point in time, so we'll need to hack our way into this
layout(constant_id = 6) const float STENCIL_WEIGHT_11 = 0.25;
layout(constant_id = 7) const float STENCIL_WEIGHT_12 = 0.5;
layout(constant_id = 8) const float STENCIL_WEIGHT_13 = 0.25;
layout(constant_id = 9) const float STENCIL_WEIGHT_21 = 0.5;
layout(constant_id = 10) const float STENCIL_WEIGHT_22 = 0.0;
layout(constant_id = 11) const float STENCIL_WEIGHT_23 = 0.5;
layout(constant_id = 12) const float STENCIL_WEIGHT_31 = 0.25;
layout(constant_id = 13) const float STENCIL_WEIGHT_32 = 0.5;
layout(constant_id = 14) const float STENCIL_WEIGHT_33 = 0.25;
//
// This function call will be inlined by any competent GPU compiler and
// will therefore not introduce any run-time overhead
mat3 stencil_weights() {
    return mat3(
        // CAUTION: GLSL matrix constructors are column-major, which is the
        //          opposite of the convention used by Rust and C/++. Let's make
        //          the life of our CPU code easier by having it provide
        //          specialization constants using its standard convention,
        //          then performing the transpose inside of the GPU compiler.
        vec3(STENCIL_WEIGHT_11, STENCIL_WEIGHT_21, STENCIL_WEIGHT_31),
        vec3(STENCIL_WEIGHT_12, STENCIL_WEIGHT_22, STENCIL_WEIGHT_32),
        vec3(STENCIL_WEIGHT_13, STENCIL_WEIGHT_23, STENCIL_WEIGHT_33)
    );
}

// Finally, the DIFFUSION_RATE constants are best exposed as a vector, following
// our general design of treating (U, V) pairs as arrays or vectors of size 2
layout(constant_id = 15) const float DIFFUSION_RATE_U = 0.1;
layout(constant_id = 16) const float DIFFUSION_RATE_V = 0.05;
//
vec2 diffusion_rate() {
    return vec2(DIFFUSION_RATE_U, DIFFUSION_RATE_V);
}

By now, you should hopefully agree with the author that manually numbering specialization constants like this is error-prone, and something that machines should be doing automatically instead of leaving it up to manual human work. But sadly, the author knows of no way around it today. Hopefully future GPU languages or vulkano updades will improve upon those pesky GLSL binding number ergonomics someday…

Data layout

At this point, we have our CPU/GPU interface fully specified, and can start writing some GPU compute shaders that our CPU code can later call into.

Notice the plural above. For this simulation, we will do a few things differently from before, so that you can get exposed of a few more ways to perform common tasks in Vulkan. Compared to the previous square example, one thing that we are going to do differently is to use two compute shaders instead of one:

One compute shader, which we will call init, will be used to initialize the U and V arrays directly on the GPU. This way we won’t need to set up expensive CPU-to-GPU transfers just to get a basic initial data pattern that can easily be GPU-generated.
After this is done, another compute shader, which we will call main, will be repeatedly used to perform the desired amount of Gray-Scott simulation steps.

We will also use a different approach to handling stencil edges. Instead of handling these via irregularly shaped input data windows (with smaller windows on the edges of the simulation domain), as we did at the start of the CPU chapter, we will start with the alternate way of padding the dataset with one line of zeroed values on each edge that encodes boundary conditions.

Zero-padded data layout

This approach to edge handling has pros and cons:

Our code logic will be simpler, which seems good as GPU hardware tends to sacrifice some ability to handle fancy code logic in the name of increased number-crunching poser.
We will need to be more careful with our data layout computations, adding/subtracting 1 to positions and 2 to storage width/height when appropriate. Basically, there is now a difference between working in the space of the simulated concentration values (inner dark gray rectangle in schematic above) and the space of the actual data storage (outer light gray rectangle), and when we switch between the two in our code we need to perform a coordinate transform.
We may or may not get memory access alignment issues that can reduce our computational performance on some hardware. If we get them, we can resolve them through clever use of extra unused padding floats. But as GPU hardware is known to be more tolerant of unaligned SIMD accesses than CPU hardware, we will not attempt to resolve this issue unless a GPU profiler tells us that we are having it on some particular hardware of interest.

To handle the “careful layout” part of this tradeoff, we will set up a few GLSL utilities that let us share more code between our two compute shaders, so that at least we only need to write the tricky data layout concern once, and can change more easily change the data layout later if needed:

// Data padding control and handling
const uint PADDING_PER_SIDE = 1;
const uint PADDED_UV_WIDTH = UV_WIDTH + 2 * PADDING_PER_SIDE;
//
// Unlike the above constants, these functions will not be zero cost. However
// all their data inputs are either constant across an entire compute
// dispatch's lifetime (data length()) or compilation constants, so the compiler
// should be able to deduplicate multiple calls to them given enough inlining.
uint padded_uv_height() { return Input[0].data.length() / PADDED_UV_WIDTH; }
uint uv_height() { return padded_uv_height() - 2 * PADDING_PER_SIDE; }

// First/last output position that corresponds to an actual data location and
// not a padding value that should always be zero
const uvec2 DATA_START_POS = uvec2(PADDING_PER_SIDE, PADDING_PER_SIDE);
uvec2 buffer_end_pos() {
    return uvec2(PADDED_UV_WIDTH, padded_uv_height());
}
uvec2 data_end_pos() {
    return buffer_end_pos() - uvec2(PADDING_PER_SIDE);
}

// Convert a 2D location into a linear buffer index
uint pos_to_index(uvec2 pos) {
    return pos.x + pos.y * PADDED_UV_WIDTH;
}

// Read an (U, V) pair from a particular input location
//
// `pos` starts at (0, 0) for the upper-left padding value, with (1, 1)
// corresponding to the first actual data value.
vec2 read(uvec2 pos) {
    uint index = pos_to_index(pos);
    return vec2(
        Input[U].data[index],
        Input[V].data[index]
    );
}

// Write an (U, V) to a particular output location, `pos` works as in `read()`
void write(uvec2 pos, vec2 value) {
    uint index = pos_to_index(pos);
    Output[U].data[index] = value.x;
    Output[V].data[index] = value.y;
}

And that will be it for the code that is shared between our two compute shaders. You can now save all of the above GLSL code, except for the initial #version 460 directive (we’ll get back to this), into a file that at location exercises/src/grayscott/common.comp. And once that is done, we will start writing some actual compute shaders.

Initialization shader

Now that we have some common utilities to interface with our datasets and configure the simulation, let us write out initialization compute shader.

This will mostly be a straightforward GLSL translation of our CPU data initialization code. But because this is GPU code, we need to make a new decision, which is the way GPU work items and workgroups will map onto the work to be done.

We have decided to go with the simple mapping illustrated by the following schematic:

Initialization workgroups

The CPU command buffer that will eventually execute this compute shader will dispatch enough workgroups (purple squares) to cover the full padded simulation dataset (red zone) with one work item per data point. But work item tasks will vary:

Padding elements (denoted “0”) will be initialized to zero as they should.
Non-padding elements (inside the zero padding) will be initialized as in the CPU version.
Out-of bounds work items (outer purple area) will exit early without doing anything.

This general scheme of having work-items at different position perform different kinds of work will reduce execution efficiency a bit on SIMD GPU hardware, however…

The impact should be minor at the target dataset size of full HD images (1920x1080 concentration values), where edge elements and out-of-bounds work items should only have a small contribution to the overall execution time.
The initialization compute shader will only execute once per full simulation run, so unless we have reasons to care about the performance of short simulation runs with very few simulation steps, we should not worry about the performance of this shader that much.

All said and done, we can implement the initialization shader using the following GLSL code:

#version 460

#include "common.comp"

// Polyfill for the standard Rust saturating_sub utility
uint saturating_sub(uint x, uint y) {
    if (x >= y) {
        return x - y;
    } else {
        return 0;
    }
}

// Data initialization entry point
void main() {
    // Map work items into 2D padded buffer, discard out-of-bounds work items
    uvec2 pos = uvec2(gl_GlobalInvocationID.xy);
    if (any(greaterThanEqual(pos, buffer_end_pos()))) {
        return;
    }

    // Fill in zero boundary condition at edge of simulation domain
    if (
        any(lessThan(pos, DATA_START_POS))
        || any(greaterThanEqual(pos, data_end_pos()))
    ) {
        write(pos, vec2(0.0));
        return;
    }

    // Otherwise, replicate standard Gray-Scott pattern in central region
    uvec2 data_pos = pos - DATA_START_POS;
    uvec2 pattern_start = uvec2(
        7 * UV_WIDTH / 16,
        saturating_sub(7 * uv_height() / 16, 4)
    );
    uvec2 pattern_end = uvec2(
        8 * UV_WIDTH / 16,
        saturating_sub(8 * uv_height() / 16, 4)
    );
    bool pattern = all(greaterThanEqual(data_pos, pattern_start))
                   && all(lessThan(data_pos, pattern_end));
    write(pos, vec2(1.0 - float(pattern), float(pattern)));
}

As is customary in this course, we will point out a few things about the above code:

Like C, GLSL supports #include preprocessor directives that can be used in order to achieve a limited from of software modularity. Here we are using it to make our two compute shaders share a common CPU-GPU interface and a few common utility constants/functions.
…but for reasons that are not fully clear to the course’s author (dubious GLSL design decision or shaderc compiler bug?), #version directives cannot be extracted into a common GLSL source file and must be present in the source code of each individual compute shader.
GLSL provides built-in vector and matrix types, which we use here in an attempt to make our 2D computations a little clearer. Use of these types may sometimes be required for performance (especially when small data types like 8-bit integers and half-precision floating point numbers get involved), but here we only use them for expressivity and concision.

To conclude on this part, we will advise that you store the initialization shader provided above into a file at location exercises/src/grayscott/init.comp.

Simulation shader

In the initialization shader that we have just covered, we needed to initialize the entire GPU dataset, padding edges included. And the most obvious way to do this was to map each GPU work item into one position of the full GPU dataset, padding zeroes included.

When it comes to the subsequent Gray-Scott reaction simulation, however, the mapping between work items and data that we should use is less immediately obvious. The two simplest approaches would be to use one work item per input data point (which would include padding, as in our initialization algorithm) or one work item per updated output U/V value (excluding padding). But these approaches entail different tradeoffs:

Using one work item per input data point allows us to expose a bit more concurrent work to the GPU (one extra work item per padding element), but as mentioned earlier the influence of such edges should be negligible when computing a larger image of size 1920x1080.
Using one work item per output data point means that each GPU work item can load all of its data inputs without cooperating with other work items, and is the only writer of its data output. But this comes at the expense of each input value being redundantly loaded ≤8 more times by the Laplacian computations associated with all output data points at neighboring 2D positions.
In contrast, with one work item per input data point, we will perfom no redundant data loading work, but will need to synchronize work items with each other in order to perform the Laplacian computation, because a Laplacian computation’s inputs now come from multiple work items. Synchronization comes at the expense of extra code complexity, and also adds some overhead that may negate the benefits of avoiding redundant memory loads.

Since there is no obviously better approach here, it is best to try both and compare their performance. Therefore, in the initial version of our Gray-Scott GPU implementation we will start with the simplest code of using one work item per output data point, which is illustrated below. And later on, after we get a basic simulation working, we will discuss optimizations that reduce the costs of redundant Laplacian input loading or eliminate such redundant loading entirely.

The following schematic summarizes the resulting execution and data access strategy:

Simulation workgroups

CPU command buffers that will execute the simulation compute shader will dispatch enough workgroups (purple squares) to cover the central region of the simulation dataset (red rectangle) with one work item per data point. Then on the GPU side, work items that map into a padding or out-of-bounds location (purple area) will be ignored and exit early.

Each remaining work item will then proceed to compute the updated (U, V) pair associated with its current location, as illustrated by the concentric blue and red squares:

The work-item will load the current (U, V) value associated with its assigned location (red square) and all neighboring (U, V) values (blue square) from the input buffers.
It will then perform computations based one these inputs that will eventually produce an updated (U, V) value, which will be written down to the matching location of the output buffers.

The resulting GLSL code looks like this…

#version 460

#include "common.comp"

// Simulation step entry point
void main() {
    // Map work items into 2D central region, discard out-of-bounds work items
    uvec2 pos = uvec2(gl_GlobalInvocationID.xy) + DATA_START_POS;
    if (any(greaterThanEqual(pos, data_end_pos()))) {
        return;
    }

    // Load central value
    vec2 uv = read(pos);

    // Compute the diffusion gradient for U and V
    uvec2 topleft = pos - uvec2(1);
    mat3 weights = stencil_weights();
    vec2 full_uv = vec2(0.0);
    float sum_weights = 0.0;
    for (int y = 0; y < 3; ++y) {
        for (int x = 0; x < 3; ++x) {
            vec2 stencil_uv = read(topleft + uvec2(x, y));
            float weight = weights[x][y];
            full_uv += weight * stencil_uv;
            sum_weights += weight;
        }
    }
    full_uv -= sum_weights * uv;

    // Deduce the change in U and V concentration
    float u = uv.x;
    float v = uv.y;
    float uv_square = u * v * v;
    vec2 delta_uv = diffusion_rate() * full_uv + vec2(
        FEED_RATE * (1.0 - u) - uv_square,
        uv_square - (FEED_RATE + KILL_RATE) * v
    );
    write(pos, uv + delta_uv * DELTA_T);
}

…and the proposed location for saving it is exercises/src/grayscott/main.comp.

SPIR-V interface

Now that the GLSL is taken care of, it is time to work on the Rust side. Inside of exercises/src/grayscott/pipeline.rs, let’s ask vulkano to build the SPIR-V shader modules and create some Rust-side constants mirroring the GLSL specialization constants as we did before…

//! Gray-Scott simulation compute pipelines

/// Shader module used for compute pipelines
mod shader {
    vulkano_shaders::shader! {
        shaders: {
            init: {
                ty: "compute",
                path: "src/grayscott/init.comp"
            },
            main: {
                ty: "compute",
                path: "src/grayscott/main.comp"
            },
        }
    }
}

/// Descriptor set that is used to bind input and output buffers to the shader
pub const DATA_SET: u32 = 0;

// Descriptor array bindings within DATA_SET, in (U, V) order
pub const IN: u32 = 0;
pub const OUT: u32 = 1;

// === Specialization constants ===
//
// Workgroup size
const WORKGROUP_SIZE_X: u32 = 0;
const WORKGROUP_SIZE_Y: u32 = 1;
//
/// Concentration table width
const UV_WIDTH: u32 = 2;
//
// Scalar simulation parameters
const FEED_RATE: u32 = 3;
const KILL_RATE: u32 = 4;
const DELTA_T: u32 = 5;
//
/// Start of 3x3 Laplacian stencil
const STENCIL_WEIGHT_START: u32 = 6;
//
/// Diffusion rate of U
const DIFFUSION_RATE_U: u32 = 15;
const DIFFUSION_RATE_V: u32 = 16;

…which will save us from the pain of figuring out magic numbers in the code later on.

Notice that in the code above, we use a variation of the default vulkano_shaders syntax, which allows us to build multiple shaders at once. This makes some things more convenient, for example auto-generated Rust structs will be deduplicated, and it is possible to set some vulkano_shaders options once for all the shaders that we are compiling.

Specialization

As before, we will define some CLI options that will let us tune our (now 2D) workgroup size…

// Add CLI args to pipeline.rs...

use clap::Args;
use std::num::NonZeroU32;

/// CLI parameters that guide pipeline creation
#[derive(Debug, Args)]
pub struct PipelineOptions {
    /// Number of rows in a workgroup
    #[arg(short = 'R', long, default_value = "8")]
    pub workgroup_rows: NonZeroU32,

    /// Number of columns in a workgroup
    #[arg(short = 'C', long, default_value = "8")]
    pub workgroup_cols: NonZeroU32,
}

// ...then put them into the overall CLI args in options.rs

use super::pipeline::PipelineOptions;

#[derive(Debug, Parser)]
#[command(version)]
pub struct Options {
    // [ ... after existing members ... ]
    #[command(flatten)]
    pub pipeline: PipelineOptions,
}

…which will then allow us to set all specialization constants from our shader modules:

// Back to pipeline.rs

use super::options::{self, Options, STENCIL_WEIGHTS};
use crate::Result;
use std::sync::Arc;
use vulkano::{
    shader::{ShaderModule, SpecializationConstant, SpecializedShaderModule},
};

/// Set up a specialized shader module with a certain workgroup size
fn setup_shader_module(
    module: Arc<ShaderModule>,
    options: &Options,
) -> Result<Arc<SpecializedShaderModule>> {
    // Set specialization constants. We'll be less careful this time because
    // there are so many of them in this kernel
    let mut constants = module.specialization_constants().clone();
    assert_eq!(
        constants.len(),
        17,
        "Unexpected amount of specialization constants"
    );
    use SpecializationConstant::{F32, U32};
    //
    *constants.get_mut(&WORKGROUP_SIZE_X).unwrap() = U32(options.pipeline.workgroup_cols.get());
    *constants.get_mut(&WORKGROUP_SIZE_Y).unwrap() = U32(options.pipeline.workgroup_rows.get());
    //
    *constants.get_mut(&UV_WIDTH).unwrap() = U32(options.runner.num_cols as _);
    //
    *constants.get_mut(&FEED_RATE).unwrap() = F32(options.update.feedrate);
    *constants.get_mut(&KILL_RATE).unwrap() = F32(options.update.killrate);
    *constants.get_mut(&DELTA_T).unwrap() = F32(options.update.deltat);
    //
    for (offset, weight) in STENCIL_WEIGHTS.into_iter().flatten().enumerate() {
        *constants
            .get_mut(&(STENCIL_WEIGHT_START + offset as u32))
            .unwrap() = F32(weight);
    }
    //
    *constants.get_mut(&DIFFUSION_RATE_U).unwrap() = F32(options::DIFFUSION_RATE_U);
    *constants.get_mut(&DIFFUSION_RATE_V).unwrap() = F32(options::DIFFUSION_RATE_V);

    // Specialize the shader module accordingly
    Ok(module.specialize(constants)?)
}

Careful readers of the square code will notice that the API design here is a little different from what we had before. We used to load the shader module inside of setup_shader_module(), whereas now we ask the caller to load the shader module and pass it down.

The reason for this change is that now we have two different compute shaders to take care of (one initialization shader and one simulation shader), and we will want to perform the same specialization constant work for both of them. And as it turns out, changing the function signature change like this will allow us to do exactly that.

Multiple shaders, single layout

As in the previous square example, our two compute shaders will each have a single entry point. Which means that we can pretty much reuse the previous setup_compute_stage() function:

/// Set up a compute stage from a previously specialized shader module
fn setup_compute_stage(module: Arc<SpecializedShaderModule>) -> PipelineShaderStageCreateInfo {
    let entry_point = module
        .single_entry_point()
        .expect("GLSL compute shader should have a single entry point");
    PipelineShaderStageCreateInfo::new(entry_point)
}

However, if we proceed to do everything else as in the square example, we will end up with two compute pipeline layouts, which means that we will need two versions of each resource descriptor set we create, one per compute pipeline. This sounds a little ugly and wasteful, so we would like to get our two compute pipelines to share a common pipeline layout.

Vulkan supports this by allowing a pipeline’s layout to describe a superset of the resources that the pipeline actually uses. If we consider our current GPU code in the eyes of this rule, this means that a pipeline layout for the main compute shader can also be used with the init compute shader, because the set of resources that init uses (Output storage block) is a subset of the set of resources that main uses (Input and Output storage blocks).

But from a software maintainability perspective, we would rather not hardcode the assumption that the main pipeline will forever use a strict superset of the resources used by all other GPU pipelines, as we might later want to e.g. adjust the definition of init in a manner that uses resources that main doesn’t need. But thankfully we do not have to do this.

The PipelineDescriptorSetLayoutCreateInfo convenience helper from vulkano that we have used earlier is not limited to operating on a single GPU entry point. Its constructor accepts an iterable set of PipelineStageCreateInfo, and if you provide multiple ones it will attempt to produce a pipeline layout that is compatible with all of the underlying entry points by computing the union of their layout requirements.

Obviously, this layout requirements union computation will only work if the the entry points do not have incompatible layout requirement (e.g. one declares that set 0, binding 0 maps into a buffer while the other declares that it is an image). But there is not risk of this happening to us here as both compute shaders share the same CPU-GPU interface specification from common.comp. So we can safely use this vulkano functionality as follows:

use vulkano::{
    descriptor_set::layout::DescriptorType,
    pipeline::layout::{PipelineDescriptorSetLayoutCreateInfo, PipelineLayout},
};

/// Set up the compute pipeline layout
fn setup_pipeline_layout<const N: usize>(
    device: Arc<Device>,
    stages: [&PipelineShaderStageCreateInfo; N],
) -> Result<Arc<PipelineLayout>> {
    // Auto-generate a sensible pipeline layout config
    let layout_info = PipelineDescriptorSetLayoutCreateInfo::from_stages(stages);

    // Check that the pipeline layout meets our expectation
    //
    // Otherwise, the GLSL interface was likely changed without updating the
    // corresponding CPU code, and we just avoided rather unpleasant debugging.
    assert_eq!(
        layout_info.set_layouts.len(),
        1,
        "This program should only use a single descriptor set"
    );
    let set_info = &layout_info.set_layouts[DATA_SET as usize];
    assert_eq!(
        set_info.bindings.len(),
        2,
        "The only descriptor set should only contain a single binding"
    );
    let input_info = set_info
        .bindings
        .get(&IN)
        .expect("An input data binding should be present");
    assert_eq!(
        input_info.descriptor_type,
        DescriptorType::StorageBuffer,
        "The input data binding should be a storage buffer binding"
    );
    assert_eq!(
        input_info.descriptor_count, 2,
        "The input data binding should contain U and V data buffer descriptors"
    );
    let output_info = set_info
        .bindings
        .get(&OUT)
        .expect("An output data binding should be present");
    assert_eq!(
        output_info.descriptor_type,
        DescriptorType::StorageBuffer,
        "The output data binding should be a storage buffer binding"
    );
    assert_eq!(
        output_info.descriptor_count, 2,
        "The output data binding should contain U and V data buffer descriptors"
    );
    assert!(
        layout_info.push_constant_ranges.is_empty(),
        "This program shouldn't be using push constants"
    );

    // Finish building the pipeline layout
    let layout_info = layout_info.into_pipeline_layout_create_info(device.clone())?;
    let layout = PipelineLayout::new(device, layout_info)?;
    Ok(layout)
}

Conclusion & Exercise

We now have all the building blocks that we need in order to build Vulkan compute pipelines for our data-initialization and simulation shaders, with a shared layout that will later allow us to have common descriptor sets for all pipelines. Time to put it all together into a single struct:

use crate::context::Context;
use vulkano::pipeline::compute::ComputePipeline;

/// Initialization and simulation pipelines with common layout information
#[derive(Clone)]
pub struct Pipelines {
    pub init: Arc<ComputePipeline>,
    pub main: Arc<ComputePipeline>,
    pub layout: Arc<PipelineLayout>,
}
//
impl Pipelines {
    // Set up all the compute pipelines
    pub fn new(context: &Context, options: &Options) -> Result<Self> {
        // TODO: Implement all these functions
    }
}

Your goal for this chapter’s exercise will be to take inspiration from the equivalent struct in the number-squaring pipeline that we studied earlier, and use this inspiration implement the new() constructor for our new set of compute pipelines.

Then you will create (for now unused) Pipelines in the main simulation binary (exercises/src/bin/simulate.rs). And after that you will make sure that debug builds of said binary executes without any error or unexpected warning from the Vulkan validation layers.

Finally, if you are a more experienced Rust developer and want to practice your generics a bit, you may also try deduplicating the logic associated with the init and main entry points inside of the Pipelines::new() constructor.

Being aware of this major shortcoming of traditional CPU programming, GPUs also support multi-dimensional memory resources backed by specialized texturing hardware, which should provide performance that’s impossible to beat with 1D buffer indexing code. So the author of this course tried to use these… and experienced great disappointment. Ask for the full story. ↩

Data & I/O

As before, after setting up our GPU compute pipelines, we will want to set up some data buffers that we can bind to those pipelines.

This process will be quite a bit simpler than before because we will not repeat the introduction to Vulkan memory management and will be using GPU-side initialization. So we will use the resulting savings in character budget to…

Show what it takes to integrate GPU data into our existing CPU simulation skeleton.
Follow the suggestion made in the previous execution chapter to break down the build_command_buffer() god-function that we used to have into multiple functions that each add a smaller amount of work to a command buffer.
Adjust our HDF5 I/O logic so that we do not need to retrieve U concentration data from the GPU that we do not actually need.

GPU dataset

New code organization

The point of Vulkan descriptor sets is to allow your application to use as few of them as possible in order to reduce resource binding overhead. In the context of our Gray-Scott simulation, the lowest we can easily¹ achieve is to have two descriptor sets.

One that uses two buffers (let’s call them U1 and V1) as inputs and two other buffers (let’s call them U2 and V2) as outputs.
Another that uses the same buffers, but flips the roles of input and output buffers. Using the above notation, U2 and U2 become the inputs, while U1 and U1 become the outputs.

Given this descriptor set usage scheme, our command buffers will alternatively bind these two descriptor sets, executing the simulation compute pipeline after each descriptor set binding call, and this will roughly replicate the double buffering pattern that we used on the CPU.

To get there, however, we will need to redesign our inner data abstractions a bit with respect to what we used to have on the CPU side. Indeed, back in the CPU course, we used to have the following separation of concerns in our code:

One struct called UV would represent a pair of tables of identical size and related contents, one representing the chemical concentration of species U and one representing the chemincal concentration of species V.
Another struct called Concentrations would represent a pair of UV structs and implement the double buffering logic for alternatively using one of these UV structs to store inputs, and the other to store outputs.

But now that we have descriptor sets that combine inputs and outputs, this program decomposition scheme doesn’t work anymore. Which is why we will have to switch to a different scheme:

One struct called UVSet will contain and manage all vulkano objects associated to one (U, V) input pair and one (U, V) output pair.
struct Concentrations will remain around, but be repurposed to manipulate pairs of UVSet rather than pairs of UV. And users of its update() function will now only be exposed to a single UVSet, instead of being exposed to a pair of UVs.

Introducing `UVSet`

Our new UVSet data structure is going to look like this:

use std::sync::Arc;
use vulkano::{buffer::subbuffer::Subbuffer, descriptor_set::DescriptorSet};

// Throughout this module, we will model (U, V) pairs as arrays of two values
// of identical type with the following indexing convention
const U: usize = 0;
const V: usize = 1;

/// GPU-side input and output (U, V) concentration table pairs.
pub struct UVSet {
    /// Descriptor set used by GPU compute pipelines
    pub descriptor_set: Arc<DescriptorSet>,

    /// Input buffers from `set`, used during GPU-to-CPU data transfers
    input_uv: [Subbuffer<[Float]>; 2],
}

As the comments point out, we are going to keep both the descriptor set and the (U, V) buffer pair around, because they are useful for different tasks:

Compute pipeline execution commands operate over descriptor sets
Buffer-to-buffer data transfer commands operate over the underlying Subbuffer objects
Because descriptor sets are a very general-purpose abstraction, going from a descriptor set to the underlying buffer objects is a rather cumbersome process.
And because Subbuffer is jute a reference-counted pointer, it does not cost us much to skip that cumbersome process by keeping around a direct reference to the underlying buffer.

Now, let us look at how an UVSet is actually set up:

use super::{
    options::Options,
    pipeline::{DATA_SET, IN, OUT},
};
use crate::{context::Context, Result};
use vulkano::{
    buffer::{Buffer, BufferCreateInfo, BufferUsage},
    descriptor_set::WriteDescriptorSet,
    memory::allocator::AllocationCreateInfo,
    pipeline::layout::PipelineLayout,
    DeviceSize,
};

/// Number of padding elements per side of the simulation domain
const PADDING_PER_SIDE: usize = 1;

impl UVSet {
    /// Allocate a set of 4 buffers that can be used to store either U and V
    /// species and can serve as an input or output.
    fn allocate_buffers(options: &Options, context: &Context) -> Result<Box<[Subbuffer<[Float]>]>> {
        let padded_rows = options.runner.num_rows + 2 * PADDING_PER_SIDE;
        let padded_cols = options.runner.num_cols + 2 * PADDING_PER_SIDE;
        let buffers = std::iter::repeat_with(|| {
            Buffer::new_slice(
                context.mem_allocator.clone(),
                BufferCreateInfo {
                    usage: BU::STORAGE_BUFFER | BU::TRANSFER_DST | BU::TRANSFER_SRC,
                    ..Default::default()
                },
                AllocationCreateInfo::default(),
                (padded_rows * padded_cols) as DeviceSize,
            )
        })
        .take(4)
        .collect::<std::result::Result<Box<[_]>, _>>()?;
        Ok(buffers)
    }

    /// Set up an `UVSet` by assigning roles to the 4 buffers that
    /// `allocate_buffers()` previously allocated.
    fn new(
        context: &Context,
        layout: &PipelineLayout,
        in_u: Subbuffer<[Float]>,
        in_v: Subbuffer<[Float]>,
        out_u: Subbuffer<[Float]>,
        out_v: Subbuffer<[Float]>,
    ) -> Result<Self> {
        // Configure which pipeline descriptor set this will bind to
        let set_layout = layout.set_layouts()[DATA_SET as usize].clone();

        // Configure what resources will attach to the various bindings
        // that this descriptor set is composed of
        let descriptor_writes = [
            WriteDescriptorSet::buffer_array(IN, 0, [in_u.clone(), in_v.clone()]),
            WriteDescriptorSet::buffer_array(OUT, 0, [out_u.clone(), out_v.clone()]),
        ];

        // Set up the descriptor set accordingly
        let descriptor_set = DescriptorSet::new(
            context.desc_allocator.clone(),
            set_layout,
            descriptor_writes,
            [],
        )?;

        // Put it all together
        Ok(Self {
            descriptor_set,
            input_uv: [in_u, in_v],
        })
    }
}

The general idea here is that because our two UVSets will refer to the same buffers, we cannot allocate the buffers internally inside of the UVSet::new() constructor. Instead we will need to allocate buffers inside of the code from Concentrations that builds UVSets, then use the same buffers twice in a different order to build the two different UVsets.

Obviously, is not so nice from an abstraction design point of view that the caller needs to know about such a thing as the right order in which buffers should be passed. But sadly this cannot be cleanly fixed at the UVSet layer, so we will fix it at the Concentrations layer instead.

Updating `Concentrations`

In the CPU simulation, the Concentrations struct used to…

…contain a pair of UV values and a boolean that clarified their input/output role,
…offload most initialization work to the lower UV layer,
…and expose an update() method whose user callback received both an immutable input (&UV) and a mutable output (&mut UV).

For the GPU simulation, we will change this as follows:

Concentrations will now contain UVSets instead of UVs.
UVSet initialization will now be handled by the Concentrations layer, as it is the one that has easy access to the output buffers of each UVSet.
The update() method will only receive a single &UVSet, as this contains all info needed to read inputs and write outputs.

The switch to UVSet is straightforward enough, and probably not worth discussing…

pub struct Concentrations {
    sets: [UVSet; 2],
    src_is_1: bool,
}

…however the constructor change will obviously be quite a bit more substantial:

use super::pipeline::Pipelines;
use vulkano::{
    command_buffer::auto::{AutoCommandBufferBuilder, PrimaryAutoCommandBuffer},
    pipeline::PipelineBindPoint,
};

impl Concentrations {
    /// Set up the GPU simulation state and schedule GPU buffer initialization
    pub fn create_and_schedule_init(
        options: &Options,
        context: &Context,
        pipelines: &Pipelines,
        cmdbuf: &mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
    ) -> Result<Self> {
        // Allocate all GPU buffers
        let [u1, v1, u2, v2] = &UVSet::allocate_buffers(options, context)?[..] else {
            panic!("Unexpected number of data buffers")
        };

        // Set up the associated UV sets
        let set1 = UVSet::new(
            context,
            &pipelines.layout,
            u1.clone(),
            v1.clone(),
            u2.clone(),
            v2.clone(),
        )?;
        let set2 = UVSet::new(
            context,
            &pipelines.layout,
            u2.clone(),
            v2.clone(),
            u1.clone(),
            v1.clone(),
        )?;

        // Schedule the initialization of the second output, which is the first
        // input, and therefore the overall simulation input.
        cmdbuf.bind_pipeline_compute(pipelines.init.clone())?;
        cmdbuf.bind_descriptor_sets(
            PipelineBindPoint::Compute,
            pipelines.layout.clone(),
            DATA_SET,
            set2.descriptor_set.clone(),
        )?;
        let padded_workgroups = [
            (options.runner.num_cols + 2 * PADDING_PER_SIDE)
                .div_ceil(options.pipeline.workgroup_cols.get() as usize) as u32,
            (options.runner.num_rows + 2 * PADDING_PER_SIDE)
                .div_ceil(options.pipeline.workgroup_rows.get() as usize) as u32,
            1,
        ];
        // SAFETY: GPU shader has been checked for absence of undefined behavior
        //         given a correct execution configuration, and this is one
        unsafe {
            cmdbuf.dispatch(padded_workgroups)?;
        }

        // Schedule the zero-initialization of the edges of the first output.
        // The center of it will be overwritten by the first simulation step,
        // so it can have any value we like, therefore it can be zeroed as well.
        cmdbuf.fill_buffer(u2.clone().reinterpret(), 0)?;
        cmdbuf.fill_buffer(v2.clone().reinterpret(), 0)?;

        // Once cmdbuf is done initializing, we will be done
        Ok(Self {
            sets: [set1, set2],
            src_is_1: false,
        })
    }

    // [ ... more methods coming up ... ]
}

The shape() accessor will be dropped, as it cannot be easily provided by our GPU storage without keeping otherwise unnecessary metadata around, but the current() accessor will trivially be migrated to the new logic…

impl Concentrations {
    // [ ... ]

    /// Read out the current species concentrations
    pub fn current(&self) -> &UVSet {
        &self.sets[self.src_is_1 as usize]
    }

    // [ ... ]
}

…and the update() operation will be easily migrated to the new logic discussed above as well, as it is largely a simplification with respect to its former implementation. There is just one new thing that we will need for GPU computing, which is the ability to report errors from GPU programs.

impl Concentrations {
    // [ ... ]


    /// Run a simulation step
    pub fn update(&mut self, step: impl FnOnce(&UVSet) -> Result<()>) -> Result<()> {
        step(self.current())?;
        self.src_is_1 = !self.src_is_1;
        Ok(())
    }
}

Output retrieval & storage

While UVSet and Concentrations are enough for the purpose of setting up the simulation and running simulation steps, we are going to need one more thing for the purpose of retrieving GPU output on the CPU side. Namely a Vulkan buffer that the CPU can access.

We could reuse the UV struct for this purpose, but if you pay attention to how the simulation output is actually used, you will notice that the io module only writes the V species’ concentration to the HDF5 file. And while passing an entire UV struct to this module anyway was fine when direct data access was possible, it is becoming wasteful if we now need to perform an expensive GPU-to-CPU transfer of the full (U, V) dataset only to use the V part exclusively later on.

Therefore, our new VBuffer abstraction will focus on retrieval of the V species’ concentration only, until a new use case comes up someday where the U species’ concentration becomes useful too.

The construction code is quite similar to the one seen before in UVSet::allocate_buffers() (and in fact should probably be deduplicated with respect to it in a more production-grade codebase). The only thing that changed is that the BufferUsage and AllocationCreateInfo have been adjusted to make this buffer fit for the purpose of downloading data to the CPU:

/// CPU-accessible storage buffer containing the V species' concentration
pub struct VBuffer {
    buffer: Subbuffer<[Float]>,
    padded_cols: usize,
}
//
impl VBuffer {
    /// Set up a `VBuffer`
    pub fn new(options: &Options, context: &Context) -> Result<Self> {
        let padded_rows = options.runner.num_rows + 2 * PADDING_PER_SIDE;
        let padded_cols = options.runner.num_cols + 2 * PADDING_PER_SIDE;
        use vulkano::memory::allocator::MemoryTypeFilter as MTFilter;
        let buffer = Buffer::new_slice(
            context.mem_allocator.clone(),
            BufferCreateInfo {
                usage: BufferUsage::TRANSFER_DST,
                ..Default::default()
            },
            AllocationCreateInfo {
                memory_type_filter: MTFilter::PREFER_HOST | MTFilter::HOST_RANDOM_ACCESS,
                ..Default::default()
            },
            (padded_rows * padded_cols) as DeviceSize,
        )?;
        Ok(Self {
            buffer,
            padded_cols,
        })
    }

    // [ ... more methods coming up ... ]
}

After that, we can add a method to schedule a GPU-to-CPU data transfer…

use vulkano::command_buffer::CopyBufferInfo;

impl VBuffer {
    // [ ... ]

    /// Schedule an update of this buffer from an `UVSet`'s current input
    pub fn schedule_update(
        &mut self,
        cmdbuf: &mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
        source: &UVSet,
    ) -> Result<()> {
        cmdbuf.copy_buffer(CopyBufferInfo::buffers(
            source.input_uv[V].clone(),
            self.buffer.clone(),
        ))?;
        Ok(())
    }

    // [ ... ]
}

…and there is just one last piece to take care of, which is to provide a way to access the inner data. Which will require a bit more work than you may expect.

To set the stage, let’s point out that we are trying to set up some communication between two Rust libraries with the following API design.

To achieve memory safety in the presence of a risk of data races between the CPU and the GPU, vulkano enforces an RAII design where accesses to a Subbuffer must go through the Subbuffer::read() method. This method returns a BufferReadGuard that borrows from the underlying Subbuffer and lets vulkano know at destruction time that it is not being accessed by the CPU anymore. Under the hood, locking and checks are then used to achieve safety.
Starting from this BufferReadGuard, which borrows memory from the underlying Subbuffer storage like a standard Rust slice of type &[Float] could borrow from a Vec<Float>, we want to add 2D layout information to it in order to turn it into an ndarray::ArrayView2<Float>, which is what the HDF5 binding that we are using ultimately expects.

Now, because the VBuffer type that we are building is logically a 2D array, it would be good API design from our side to refrain from exposing the underlying 1D dataset in the VBuffer API and instead only provide users with the ArrayView2 that they need for HDF5 I/O and other operations. While we are at it, we would also rather not expose the zero padding elements to the user, as they won’t be part of the final HDF5 file and are arguably an implementation detail of our current Gray-Scott simulation implementation.

We can get all of those good things, as it turns out, but the simplest way for us to get there² will be a somewhat weird callback-based interface:

use ndarray::prelude::*;

impl VBuffer {
    // [ ... ]

    /// Access the inner V species concentration as a 2D array without padding
    ///
    /// Before calling this method, you will generally want to schedule an
    /// update, submit the resulting command buffer, and await its completion.
    pub fn read_and_process<R>(&self, callback: impl FnOnce(ArrayView2<Float>) -> R) -> Result<R> {
        // Access the underlying dataset as a 1D slice
        let read_guard = self.buffer.read()?;

        // Create an ArrayView2 that covers the whole data, padding included
        let padded_cols = self.padded_cols;
        let padded_elements = read_guard.len();
        assert_eq!(padded_elements % padded_cols, 0);
        let padded_rows = padded_elements / padded_cols;
        let padded_view = ArrayView::from_shape([padded_rows, padded_cols], &read_guard)?;

        // Extract the central region of padded_view, excluding padding
        let data_view = padded_view.slice(s!(
            PADDING_PER_SIDE..(padded_rows - PADDING_PER_SIDE),
            PADDING_PER_SIDE..(padded_cols - PADDING_PER_SIDE),
        ));

        // We're now ready to run the user callback
        Ok(callback(data_view))
    }
}

The general idea here is that a user who wants to read the contents of the buffer will pass us a function (typically a lambda) that takes the current contents of the buffer (as an un-padded ArrayView2) and returns a result of an arbitrary type R that we do not care about. On our side, we will then proceed to do everything needed to set up the two-dimensional array view, call the user-specified function, and return the result.

HDF5 I/O refactor

As mentioned earlier, one last thing that should change with respect to our former CPU code is that we want our HDF5 I/O module to be clearer about what it wants.

Indeed, at present time, HDF5Writer::write() demands a full set of (U, V) data of which it only uses the V concentration data. This was fine from a CPU programming perspective where we don’t pay for exposing unused data access opportunities, but from a GPU programming perspective it means downloading U concentration data that the HDF5 I/O module is not going to use.

We will fix this by making the HDF5Writer more explicit about what it wants, and having it take the V species concentration only instead.

// In exercises/src/grayscott/io.rs

use ndarray::ArrayView2;

impl HDF5Writer {
    // [ ... ]

    /// Write a new V species concentration table to the file
    pub fn write(&mut self, v: ArrayView2<Float>) -> hdf5::Result<()> {
        // FIXME: Workaround for an HDF5 binding limitation
        let v = v.to_owned();
        self.dataset.write_slice(&v, (self.position, .., ..))?;
        self.position += 1;
        Ok(())
    }

    // [ ... ]
}

Notice the FIXME above. Apparently, the Rust HDF5 binding we are using does not yet handle ArrayView2s whose rows are not contiguous in memory, which means that we must create a contiguous copy of v before it accepts to write it to a file.

From the author’s memory of the HDF5 C API, it can handle this, and this limitation is specific to the Rust bindings that should be fixed. Until a fix happens, however, making an owned contiguous copy should be a reasonably efficient workaround, as for typical storage devices in-RAM copies are much faster than writing data to the target storage device.

As an alternative, we could also modify our GPU-to-CPU copy logic so that it does not copy the padding zero elements, saving a bit of CPU-GPU interconnect bandwidth along the way. However this will require us to stop using standard Vulkan copy commands and use custom shaders for this purpose instead, which may in turn cause two issues:

Performance may be worse, because the standard Vulkan copy command should have been well-optimized by the GPU vendor. Our shader would need to be optimized similarly.
We will very likely lose the ability to overlap GPU-to-CPU copies with computations, which we are not using yet but may want to use later as an optimization.

As always, tradeoffs are the name of the game in engineering…

Exercise

In the data module of the Gray-Scott reaction simulation (exercises/src/grayscott/data.rs), replace the UV and Concentrations structs with the UVSet, Concentrations and VBuffer types introduced in this chapter.

After that is done, proceed to modify the io module of the simulation so that it works with borrowed V concentration data only, as discussed above.

You will find that the simulation does not compile at this point. This is expected because the run_simulation() and update() function of the simulation library have not been updated yet. We will fix that in the next chapter, for now just make sure that there is no compilation error originating from a mistake in data.rs or io.rs.

Without losing the benefits of GLSL’s readonly and writeonly qualifiers and introducing new Vulkan concepts like push constants, that is. ↩
It is possible to write a callback-free read() method that returns an object that behaves like an ArrayView2, but implementing it efficiently (without recreating the ArrayView2 on every access) involves building a type that is self-referential in the eyes of the Rust’s compiler lifetime analysis. Which means that some dirty unsafe tricks will be required. ↩

Integration

After a long journey, we are once again reaching the last mile where we almost have a complete Gray-Scott reaction simulation. In this chapter, we will proceed to walk this last mile and get everything working again, on GPU this time.

Simulation commands

In the previous chapter, we have attempted to increase separation of concerns across the simulation codebase so that one function is not responsible for all command buffer manipulation work.

Thanks to this work, we can have a simulation scheduling function that is a lot less complex than the build_command_buffer() function we used to have in our number-squaring program:

use self::{
    data::Concentrations,
    options::Options,
    pipeline::{Pipelines, DATA_SET},
};
use crate::Result;
use vulkano::{
    command_buffer::auto::{AutoCommandBufferBuilder, PrimaryAutoCommandBuffer},
    pipeline::PipelineBindPoint,
};

/// Record the commands needed to run a bunch of simulation iterations
pub fn schedule_simulation(
    options: &Options,
    pipelines: &Pipelines,
    cmdbuf: &mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
    concentrations: &mut Concentrations,
) -> Result<()> {
    // Determine the appropriate workgroup size for the simulation
    let simulate_workgroups = [
        options
            .runner
            .num_cols
            .div_ceil(options.pipeline.workgroup_cols.get() as usize) as u32,
        options
            .runner
            .num_rows
            .div_ceil(options.pipeline.workgroup_rows.get() as usize) as u32,
        1,
    ];

    // Schedule the requested number of simulation steps
    cmdbuf.bind_pipeline_compute(pipelines.main.clone())?;
    for _ in 0..options.runner.compute_steps_per_output_step {
        concentrations.update(|uvset| {
            cmdbuf.bind_descriptor_sets(
                PipelineBindPoint::Compute,
                pipelines.layout.clone(),
                DATA_SET,
                uvset.descriptor_set.clone(),
            )?;
            // SAFETY: GPU shader has been checked for absence of undefined behavior
            //         given a correct execution configuration, and this is one
            unsafe {
                cmdbuf.dispatch(simulate_workgroups)?;
            }
            Ok(())
        })?;
    }
    Ok(())
}

There are a few things worth pointing out here:

Unlike our former build_command_buffer() function, this function does not build its own command buffer, but only adds extra commands to a caller-allocated existing command buffer. This will allow us to handle data initialization more elegantly later.
We are computing the compute pipeline dispatch size on each run of this function, which depending on compiler optimizations may or may not result in redundant work. The quantitative overhead of this work should be so small compared to everything else in this function, however, that we do not expect this small inefficiency to matter. But we will check this when the time comes to profile our program’s CPU utilization.
We are enqueuing an unbounded amount of commands to our command buffer here, and the GPU will not start executing work until we are done building and submitting the associated command buffer. As we will later see in this course’s optimization section, this can become a problem in unusual execution configurations where thousands of simulations steps occur between each generated image. The way to fix this problem will be discussed in the corresponding course chapter, after taking care of higher-priority optimizations.

Simulation runner

With our last utility function written down, it is time to tackle the meat of the issue, and adapt our formerly CPU-centric run_simulation() utility so that it can run the GPU computation.

And while we are at it, we will also fix another design issue of our former CPU code, which is that we needed to duplicate a lot of logic between out simulation binary and our microbenchmark.

As this logic is getting more complicated in our GPU version, this is becoming a more pressing problem. So we will fix it, at the expense of reducing our benchmark’s level of detail, by generalizing run_simulation() so that it is useful for microbenchmarking in addition to regular simulation:

use self::data::UVSet;
use crate::context::Context;
use vulkano::{
    command_buffer::{CommandBufferUsage, PrimaryCommandBufferAbstract},
    sync::GpuFuture,
};

/// Simulation runner, with a user-configurable hook to...
///
/// - Schedule extra work in the command buffer where the simulation steps are
///   being recorded, knowing the final simulation state.
/// - Perform extra work after the GPU is done executing work.
pub fn run_simulation(
    options: &Options,
    context: &Context,
    mut schedule_after_simulation: impl FnMut(
        &UVSet,
        &mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
    ) -> Result<()>,
    mut after_gpu_wait: impl FnMut() -> Result<()>,
) -> Result<()> {
    // Set up the GPU compute pipelines
    let pipelines = Pipelines::new(context, options)?;

    // Set up the initial command buffer
    let new_cmdbuf = || {
        AutoCommandBufferBuilder::primary(
            context.comm_allocator.clone(),
            context.queue.queue_family_index(),
            CommandBufferUsage::OneTimeSubmit,
        )
    };
    let mut cmdbuf = new_cmdbuf()?;

    // Set up the concentrations storage and schedule initialization
    let mut concentrations =
        Concentrations::create_and_schedule_init(options, context, &pipelines, &mut cmdbuf)?;

    // Produce the requested amount of concentration tables
    for _ in 0..options.runner.num_output_steps {
        // Schedule some simulation steps
        schedule_simulation(options, &pipelines, &mut cmdbuf, &mut concentrations)?;

        // Schedule any other user-requested work after the simulation
        schedule_after_simulation(concentrations.current(), &mut cmdbuf)?;

        // Submit the work to the GPU and wait for it to execute
        cmdbuf
            .build()?
            .execute(context.queue.clone())?
            .then_signal_fence_and_flush()?
            .wait(None)?;

        // Perform operations after the GPU is done
        after_gpu_wait()?;

        // Set up the next command buffer
        cmdbuf = new_cmdbuf()?;
    }
    Ok(())
}

Here is a summary of changes with respect to the previous version of run_simulation():

HDF5 I/O concerns are not handled by run_simulation() anymore. This concern is now offloaded to the caller, which can handle it using a pair of new user-defined hooks:¹
- schedule_after_simulation() is called after the simulation engine is done filling up a command buffer with simulation pipeline executions. It lets the caller add GPU commands to e.g. download the final simulation state from the CPU to the GPU.
- after_gpu_wait() is called after waiting for the GPU to be done. It lets the caller e.g. read the final CPU copy of the concentration of V and save it to disk.
The update() hook is gone, as in GPU programs there are fewer opportunities than in CPU programs to optimize the simulation update logic without modifying other aspects of the simulation like the data management, so this configurability does not pull its weight.
The caller is now expected to pass in a pre-initialized GPU context.
The result type is more general than before (where it used to be HDF5-specific) to account for the new possibility of GPU API errors.
GPU compute pipelines and command buffers must now be set up, and command buffers must also be submitted to the GPU and awaited.
The simulation runner does not manage individual simulation steps anymore, as on a GPU this would have unbearable synchronization costs. Instead, simulation steps are executed as batches of size compute_steps_per_output_step.

And with that, we are done with the parts of the simulation logic that are shared between the main binary and the microbenchmark, so you can basically replace the entire contents of exercises/src/grayscott/mod.rs with the above two functions.

Main simulation binary

Now that we have altered the API contract of run_simulation(), we also need to rewrite much of the main simulation binary accordingly. The part up to Vulkan context setup remains the same, but then we need to do this:

use grayscott_exercises::grayscott::{data::VBuffer, io::HDF5Writer};

fn main() -> Result<()> {
    // [ ... parse CLI options, set up progress bar & Vulkan context ... ]

    // Set up the CPU buffer for concentrations download
    let vbuffer = RefCell::new(VBuffer::new(&options, &context)?);

    // Set up the HDF5 file output
    let mut hdf5 = HDF5Writer::create(
        &options.runner.file_name,
        [options.runner.num_rows, options.runner.num_cols],
        options.runner.num_output_steps,
    )?;

    // Run the simulation
    grayscott::run_simulation(
        &options,
        &context,
        |uv, cmdbuf| {
            // Schedule a download of the final simulation state
            vbuffer.borrow_mut().schedule_update(uv, cmdbuf)
        },
        || {
            // Write down the current simulation output
            vbuffer.borrow().read_and_process(|v| hdf5.write(v))??;

            // Record that progress has been made
            progress.inc(options.runner.compute_steps_per_output_step as u64);
            Ok(())
        },
    )?;

    // Close the HDF5 file with proper error handling
    hdf5.close()?;

    // Declare the computation finished
    progress.finish();
    Ok(())
}

What is new here?

We need to set up a CPU buffer to download our GPU data into. And because we are using a run_simulation() design with two hooks that both use this vbuffer (see footnote¹), the Rust compiler’s static lifetime analysis gets overwhelmed and we need to switch to dynamic lifetime analysis (RefCell) to work around it.
Because HDF5 I/O is now the responsibility of the simulate binary, we take care of it here.
We leverage the two hooks provided by run_simulation() for their intended purpose: to download GPU results to the CPU, save them to the HDF5 file, and record that progress has been made in our progress bar.

Exercises

Integrate the above code into the main simulation binary (exercises/src/bin/simulate.rs), then…

Do a simulation test run (cargo run --release -- -n100)
Use mkdir -p pics && data-to-pics -o pics to convert the output data into PNG images
Use your favorite image viewer to check that the resulting images look about right

Beyond that, the simulate benchmark (exercises/benches/simulate.rs) has been pre-written for you in order to exercise the final simulation engine in various configurations. Check out the code to get a general idea of how it works, then run it for a while (cargo bench --bench simulate) and see how the various tunable parameters affect performance.

Do not forget that you can also pass in a regular expression argument (as in e.g. cargo bench --bench simulate -- '2048x.*32steps.*compute$') in order to only benchmark specific configurations.

This could be done more cleanly with a single trait, but the author has not yet found a way to introduce the awesomeness of traits in sufficient depth in this time-constrained course. ↩ ↩2

Asynchronous I/O

Identifying the bottleneck

Now that our Gray-Scott reaction simulation is up and running, and seems to produce sensible results, it is time to optimize it. But this begs the question: what should we optimize first?

The author’s top suggestion here would be to use a profiling tool to analyze where time is spent. But unfortunately the GPU profiling ecosystem is messier than it should be and there is no single tool that will work for all environment configurations that you may use to take this course.

Therefore, we will have to resort to the slower approach of learning things about our application’s performance by asking ourselves questions and answering them through experiments.

One first question that we can ask is whether our application is most limited by the speed at which it performs computations or writes data down. On Linux, this question can be easily answered by comparing two timed runs of the application:

One in the default configuration, where output data is written to the main storage device.
One in a configuration where output data is written to RAM using tmpfs magic.

Because RAM is much faster than nonvolatile storage devices even when used via the tmpfs filesystem, a large timing difference between these two timings will be a dead giveaway that our performance is limited by storage performance…

# Write output to main storage (default)
$ rm -f output.h5 && cargo build --release --bin simulate && time cargo run --release --bin simulate
    [ ... ]
real    2m1,490s
user    0m2,204s
sys     0m9,170s

# Write output to /dev/shm ramdisk
$ rm -f /dev/shm/output.h5 && cargo build --release --bin simulate && time cargo run --release --bin simulate -- -o /dev/shm/output.h5
    [ ... ]
real    0m19,320s
user    0m2,163s
sys     0m6,395s

…and indeed, it looks like storage performance is our main bottleneck here.

Picking a strategy

Storage performance bottlenecks can be tackled in various ways. Here are some things that we could try in rough order of decreasing expected performance impact:

Make sure we are using the fastest available storage device that fits our needs
Install a faster storage device into the machine and use it
Store less data (e.g. spend more simulation steps between two writes)
Store lower-precision data (e.g. half-precision floats, other lossy compression)
Store the same data more efficiently (lossless compression e.g. LZ4)
Offload storage access to dedicated CPU threads so it doesn’t need to wait for compute
Tune lower-level parameters of the underlying storage I/O e.g. block size, data format…

Our performance test above was arguably already an example of strategy 1 at work: as ramdisks are almost always the fastest storage device available, they should always be considered as an option for file outputs of modest size that do not need non-volatile storage.

But because this school is focused on computation performance, we will only cover strategy 6, owing to its remarkable ease of implementation, before switching to an extreme version of option 3 where we will simply disable storage I/O and focus our attention to compute performance only.

Asynchronous I/O 101

One simple scheme for offloading I/O to a dedicated thread without changing output file contents is to have the compute and I/O thread communicate via a bounded FIFO queue.

In this scheme, the main compute thread will submit data to this queue as soon as it becomes available, while the I/O thread will fetch data from that queue and write it to the storage device. Depending on the relative speed at which each thread is working, two things may happen:

If the compute thread is faster than the I/O thread, the FIFO queue will quickly fill up until it reaches its maximal capacity, and then the compute thread will block. As I/O tasks complete, the compute thread will be awokened to compute more data. Overall…
- The I/O thread will be working 100% of the time, from its perspective it will look like input data is computed instantaneously. That’s the main goal of this optimization.
- The compute thread will be intermittently stopped to leave the I/O thread some time to process incoming data, thus preventing a scenario where data accumulates indefinitely resulting in unbounded RAM footpring growth. This process called backpressure is a vital part any well-designed asynchronous I/O implementation.
If the I/O thread were faster than the compute thread, then the situation would be somewhat reversed: the compute thread would be working 100% of the time, while the I/O thread would intermittently block waiting for data.
- This is where we would have ended up if we implemented this optimization back in the CPU course, where the computation was too slow to saturate the I/O device.
- In this situation, asynchronous I/O is a more dubious optimization because as we will see it has a small CPU cost, which we don’t want to pay when CPU computations already are the performance-limiting factor.

Real-world apps will not perform all computations and I/O transactions at the same speed, which may lead them to alternate between these two behaviors. In that case, increasing the bounded size of the FIFO queue may be helpful:

On the main compute thread side, it will allow compute to get ahead of I/O when it is faster by pushing more images in the FIFO queue…
…which will later allow the I/O thread to continue interrupted for a while if for some reason I/O transactions speed up or CPU work slows down.

First implementation

As mentioned above, one critical tuning parameter of an asynchronous I/O implementation is the size of the bounded FIFO queue that the I/O and compute thread use to communicate. Like many performance tuning parameters, we will start by exposing it as a command-line argument:

// In exercises/src/grayscott/options.rs

/// Simulation runner options
#[derive(Debug, Args)]
pub struct RunnerOptions {
    // [ ... existing entries ... ]

    /// I/O buffer size
    ///
    /// Increasing this parameter will improve the application's ability to
    /// handle jitter in the time it takes to perfom computations or I/O without
    /// interrupting the I/O stream, at the expense of increasing RAM usage.
    #[arg(short = 'i', long, default_value_t = 1)]
    pub io_buffer: usize,
}

Then, in the main simulation binary, we will proceed to extract all of our HDF5 I/O work into a dedicated thread, to which we can offload work via a bounded FIFO queue, which the Rust standard library provides in the form of synchronous Multi-Producer Single-Consumer (MPSC) channels:

// In exercises/src/bin/simulate.rs

use grayscott_exercises::grayscott::data::Float;
use ndarray::Array2;
use std::{sync::mpsc::SyncSender, thread::JoinHandle};

/// `SyncSender` for V species concentration
type Sender = SyncSender<Array2<Float>>;

/// `JoinHandle` for the I/O thread
type Joiner = JoinHandle<hdf5::Result<()>>;

/// Set up an I/O thread
fn setup_io_thread(options: &Options, progress: ProgressBar) -> hdf5::Result<(Sender, Joiner)> {
    let (sender, receiver) = std::sync::mpsc::sync_channel(options.runner.io_buffer);
    let mut hdf5 = HDF5Writer::create(
        &options.runner.file_name,
        [options.runner.num_rows, options.runner.num_cols],
        options.runner.num_output_steps,
    )?;
    let compute_steps_per_output_step = options.runner.compute_steps_per_output_step as u64;
    let handle = std::thread::spawn(move || {
        for v in receiver {
            hdf5.write(v)?;
            progress.inc(compute_steps_per_output_step);
        }
        hdf5.close()?;
        progress.finish();
        Ok(())
    });
    Ok((sender, handle))
}

Usage of MPSC channels aside, the main notable thing in the above code is the use of the std::thread::spawn API to spawn an I/O thread. This API returns a JoinHandle, which can later be used to wait for the I/O thread to be done processing all previously sent work.

Another thing that the astute reader will notice about the above code is that it consumes the V species’ concentration as an owned table, rather than a borrowed view. This is necessary because after sending the concentration data to the I/O thread, the compute thread will not wait for I/O and immediately proceed to overwrite the associated VBuffer with new data.

But this also means that we will always be sending owned data to our HDF5 writer, so we can drop our data-cloning workaround and redefine the writer’s interface to accept owned data instead:

// In exercises/src/grayscott/io.rs

use ndarray::Array2;

impl HDF5Writer {
    [ ... ]

    /// Write a new V species concentration table to the file
    pub fn write(&mut self, v: Array2<Float>) -> hdf5::Result<()> {
        self.dataset.write_slice(&v, (self.position, .., ..))?;
        self.position += 1;
        Ok(())
    }

    [ ... ]
}

Finally, we can rewrite our main simulation function to use the new threaded I/O infrastructure…

// In exercises/src/bin/simulate.rs

fn main() -> Result<()> {
    // Parse command line options
    let options = Options::parse();

    // Set up the progress bar
    let progress = ProgressBar::new(
        (options.runner.num_output_steps * options.runner.compute_steps_per_output_step) as u64,
    );

    // Start the I/O thread
    let (io_sender, io_handle) = setup_io_thread(&options, progress.clone())?;

    // Set up the Vulkan context
    let context = Context::new(&options.context, false, Some(progress))?;

    // Set up the CPU buffer for concentrations download
    let vbuffer = RefCell::new(VBuffer::new(&options, &context)?);

    // Run the simulation
    grayscott::run_simulation(
        &options,
        &context,
        |uv, cmdbuf| {
            // Schedule a download of the final simulation state
            vbuffer.borrow_mut().schedule_update(uv, cmdbuf)
        },
        || {
            // Schedule a write of the current simulation output
            vbuffer
                .borrow()
                .read_and_process(|v| io_sender.send(v.to_owned()))??;
            Ok(())
        },
    )?;

    // Signal the I/O thread that we are done writing, then wait for it to finish
    std::mem::drop(io_sender);
    io_handle.join().expect("The I/O thread has crashed")?;
    Ok(())
}

Most of this should be unsurprising to you if you understood the above explanations, but there is a bit of trickery at the end that is worth highlighting.

// Signal the I/O thread that we are done writing, then wait for it to finish
std::mem::drop(io_sender);
io_handle.join().expect("The I/O thread has crashed")?;

These two lines work around a surprising number of Rust standard library usability gotchas:

To properly handle unexpected errors in Rust threads (e.g. panics due to incorrect array indexing), it is a good idea to explicitly join them…
- …but the associated join() method returns a Result type whose error type does not implement the standard Error trait, so we can only handle it via panicking.
Rust MPSC channels have a very convenient feature which ensures that we can tell a thread that we are done sending data by simply dropping the channel’s SyncSender input interface, which happens automatically when it goes out of scope…
- …but that may be too late in present of explicit .join() as the main thread may end up waiting on the I/O thread, which itself is waiting for the main thread to stop sending data, resulting in deadlock. To avoid this, we must explicitly drop the SyncSender somehow. Here we are using std::mem::drop() for this.

In any case, we are now ready to reap the first benefits of our optimization, which will be most visible on fast storage backends like tmpfs:

# Before
real    0m19,320s
user    0m2,163s
sys     0m6,395s

# Now
$ rm -f /dev/shm/output.h5 && cargo build --release --bin simulate && time cargo run --release --bin simulate -- -o /dev/shm/output.h5
    [ ... ]
real    0m9,670s
user    0m2,960s
sys     0m3,719s

Notice that contrary to conventional wisdom, in addition to effectively halving the execution time in this best-case scenario, adding an I/O thread has also decreased our overall CPU time utilization. This is thanks to a strong decrease in the amount of “sys” time spent in the OS kernel which more than compensates the expected small increase in “user” time spent in our code. Sadly the author has not yet found the time to dig more into this interesting phenomenon.

TODO: Profile then try allocation reuse, multithreaded copies

Keyboard shortcuts

Numerical Computing with Rust on GPU

Mixing indicatif with console output

Mixing `indicatif` with console output