Linux containers

Switching to the new source code

As you may remember, when setting up your container for the CPU course, you started by downloading and unpacking an archive which contains a source code directory called exercises/.

We will do mostly the same for this course, but the source code will obviously be different. Therefore, please rename your previous exercises directory to something else (or switch to a different parent directory), then follow the following instructions.

Provided that the curl and unzip utilities are installed, you can download and unpack the source code in the current directory using the following sequence of Unix commands:

if [ -e exercises ]; then
    echo "ERROR: Please move or delete the existing 'exercises' subdirectory"
else
    curl -LO https://numerical-rust-gpu-96deb7.pages.in2p3.fr/setup/exercises.zip  \
    && unzip exercises.zip  \
    && rm exercises.zip
fi

Get a PowerShell terminal, then cd into the place where you would like to download the exercises’ source code and run the following script:

Invoke-Command -ScriptBlock {
      $ErrorActionPreference="Stop"
      if (Test-Path exercises) {
            throw "ERROR: Please move or delete the existing 'exercises' subdirectory"
      }
      Invoke-WebRequest https://numerical-rust-gpu-96deb7.pages.in2p3.fr/setup/exercises.zip  `
                        -OutFile exercises.zip
      Expand-Archive exercises.zip -DestinationPath .
      Remove-Item exercises.zip
}

Switching to the GPU image

During the CPU course, you have used a container image with a name that has numerical-rust-cpu in it, such as gitlab-registry.in2p3.fr/grasland/numerical-rust-cpu/rust_light:latest. It is now time to switch to another version of this image that has GPU tooling built into it.

If you used the image directly, that’s easy, just replace cpu with gpu in the image name and all associated container execution commands that you use. In the above example, you would switch gitlab-registry.in2p3.fr/grasland/numerical-rust-gpu/rust_light:latest.
If you built a container image of your own on top of the course’s image, then you will have a bit more work to do, in the form of replaying your changes on top of your new images. Which shouldn’t be too hard either… if you used a proper Dockerfile instead of raw docker commit.

But unfortunately, that’s not the end of it. Try to run vulkaninfo --summary inside of the resulting container, and you will likely figure out that some of your host GPUs are likely not visible inside of the container. If that’s the case, then I have bad news for you: you have some system-specific work to do if you want to be able to use your GPUs inside of the container.

Exposing host GPUs

Please click the following tab that best describes your host system for further guidance:

In the host setup section, we mentioned that NVidia’s Linux drivers use a monolithic design. Their GPU kernel driver and Vulkan implementation are packaged together in such a way that the Vulkan implementation is only guaranteed to work if paired with the exact GPU kernel driver from the same NVidia driver package version.

As it turns out, this design is not just unsatisfying from a software engineering best practices perspective. It also becomes an unending source of pain as soon as containers get involved.

A first problem is that NVidia’s GPU driver resides in the Linux kernel while the Vulkan driver is implemented as a user-space library. Whereas the whole idea of Linux containers is to keep the host’s kernel while replacing the userspace libraries and executables with those of a different Linux system. And unless the Linux distribution of the host and containerized systems are the same, the odds that they will use the exact same NVidia driver package version are low.

To work around this, many container runtimes provide an option called --gpus (Docker, Podman) or --nv (Apptainer, Singularity) that lets you mount a bunch of files from the user-space components of the NVidia driver of the host system.

This is pretty much the only way to get the NVidia GPU driver to work inside of a container, but it comes at a price: GPU programs inside of the container will be exposed to NVidia driver binaries that were not the ones that they were compiled and tested against, and which they may or may not be compatible with. In that sense, those container runtime options undermine the basic container promise of executing programs in a well-controlled environment.

To make matters worse, the NVidia driver package actually contains not just one, but two different Vulkan backends. One that is specialized towards X11 graphical environments, and another that works in Wayland and headless environment. As bad luck would have it, the backend selection logic gets confused by the hacks needed to get the NVidia driver to work inside of a Linux container, and wrongly selects the X11 backend. Which won’t work as this course’s containers do not have even a semblance of an X11 graphics rendering stack, because they don’t need one.

That second issue can be fixed by modifying an environment variable to override the NVidia Vulkan implementation’s default backend selection logic and select the right one. But that will come at the expense of losing support for every other GPU on the system including the llvmpipe GPU emulator. As this is a high-performance computing course, and NVidia GPUs tend to be more powerful than any other GPU featured in the same system, we will consider this as an acceptable tradeoff.

Putting it all together, adding the following command-line option to your docker/podman/apptainer/singularity run commands should allow you to use your host’s NVidia GPUs from inside the resulting container:

--gpus=all --env VK_ICD_FILENAMES=/usr/share/glvnd/egl_vendor.d/10_nvidia.json

--nv --env VK_ICD_FILENAMES=/usr/share/glvnd/egl_vendor.d/10_nvidia.json

New command line arguments and container image name aside, the procedure for starting up a container will be mostly identical to that used for the CPU course. So you will want to get back to the appropriate section of the CPU course’s container setup instructions and follow the instructions for your container and system configuration again.

Once that is done, please run vulkaninfo --summary inside of a shell within the container and check that the Vulkan device list matches what you get on the host, driver version details aside.

Testing your setup

Your Rust development environment should now be ready for this course’s practical work. I strongly advise testing it by running the following script:

curl -LO https://gitlab.in2p3.fr/grasland/numerical-rust-gpu/-/archive/solution/numerical-rust-gpu-solution.zip  \
&& unzip numerical-rust-gpu-solution.zip  \
&& rm numerical-rust-gpu-solution.zip  \
&& cd numerical-rust-gpu-solution/exercises  \
&& echo "------"  \
&& cargo run --release --bin info -- -p  \
&& echo "------"  \
&& cargo run --release --bin square -- -p  \
&& cd ../..  \
&& rm -rf numerical-rust-gpu-solution

It performs the following actions, whose outcome should be manually checked:

Run a Rust program that should produce the same device list as vulkaninfo --summary. This tells you that any device that gets correctly detected by a C Vulkan program also gets correctly detected by a Rust Vulkan program, as one would expect.
Run another program that uses a simple heuristic to pick the Vulkan device that should be most performant, then uses that device to square an array of floating-point numbers, then checks the results. You should make sure the device selection that this program made is sensible and its final result check passed.
If everything went well, the script will clean up after itself by deleting all previously created files.

Keyboard shortcuts

Numerical Computing with Rust on GPU

Linux containers

Switching to the new source code

Switching to the GPU image

Exposing host GPUs

Testing your setup