Linux containers
Switching to the new source code
As you may remember, when setting up your container for the CPU course, you
started by downloading and unpacking an archive which contains a source code
directory called exercises/
.
We will do mostly the same for this course, but the source code will obviously
be different. Therefore, please rename your previous exercises
directory to
something else (or switch to a different parent directory), then follow the
following instructions.
Provided that the curl
and unzip
utilities are installed, you can download
and unpack the source code in the current directory using the following sequence
of Unix commands:
if [ -e exercises ]; then
echo "ERROR: Please move or delete the existing 'exercises' subdirectory"
else
curl -LO https://numerical-rust-gpu-96deb7.pages.in2p3.fr/setup/exercises.zip \
&& unzip exercises.zip \
&& rm exercises.zip
fi
Switching to the GPU image
During the CPU course, you have used a container image with a name that has
numerical-rust-cpu
in it, such as
gitlab-registry.in2p3.fr/grasland/numerical-rust-cpu/rust_light:latest
. It is
now time to switch to another version of this image that has GPU tooling built
into it.
- If you used the image directly, that’s easy, just replace
cpu
withgpu
in the image name and all associated container execution commands that you use. In the above example, you would switchgitlab-registry.in2p3.fr/grasland/numerical-rust-gpu/rust_light:latest
. - If you built a container image of your own on top of the course’s image, then
you will have a bit more work to do, in the form of replaying your changes on
top of your new images. Which shouldn’t be too hard either… if you used a
proper
Dockerfile
instead of rawdocker commit
.
But unfortunately, that’s not the end of it. Try to run vulkaninfo --summary
inside of the resulting container, and you will likely figure out that some of
your host GPUs are likely not visible inside of the container. If that’s the
case, then I have bad news for you: you have some system-specific work to do if
you want to be able to use your GPUs inside of the container.
Exposing host GPUs
Please click the following tab that best describes your host system for further guidance:
In the host setup section, we mentioned that NVidia’s Linux drivers use a monolithic design. Their GPU kernel driver and Vulkan implementation are packaged together in such a way that the Vulkan implementation is only guaranteed to work if paired with the exact GPU kernel driver from the same NVidia driver package version.
As it turns out, this design is not just unsatisfying from a software engineering best practices perspective. It also becomes an unending source of pain as soon as containers get involved.
A first problem is that NVidia’s GPU driver resides in the Linux kernel while the Vulkan driver is implemented as a user-space library. Whereas the whole idea of Linux containers is to keep the host’s kernel while replacing the userspace libraries and executables with those of a different Linux system. And unless the Linux distribution of the host and containerized systems are the same, the odds that they will use the exact same NVidia driver package version are low.
To work around this, many container runtimes provide an option called --gpus
(Docker, Podman) or --nv
(Apptainer, Singularity) that lets you mount a bunch
of files from the user-space components of the NVidia driver of the host system.
This is pretty much the only way to get the NVidia GPU driver to work inside of a container, but it comes at a price: GPU programs inside of the container will be exposed to NVidia driver binaries that were not the ones that they were compiled and tested against, and which they may or may not be compatible with. In that sense, those container runtime options undermine the basic container promise of executing programs in a well-controlled environment.
To make matters worse, the NVidia driver package actually contains not just one, but two different Vulkan backends. One that is specialized towards X11 graphical environments, and another that works in Wayland and headless environment. As bad luck would have it, the backend selection logic gets confused by the hacks needed to get the NVidia driver to work inside of a Linux container, and wrongly selects the X11 backend. Which won’t work as this course’s containers do not have even a semblance of an X11 graphics rendering stack, because they don’t need one.
That second issue can be fixed by modifying an environment variable to override
the NVidia Vulkan implementation’s default backend selection logic and select
the right one. But that will come at the expense of losing support for every
other GPU on the system including the llvmpipe
GPU emulator. As this is a
high-performance computing course, and NVidia GPUs tend to be more powerful than
any other GPU featured in the same system, we will consider this as an
acceptable tradeoff.
Putting it all together, adding the following command-line option to your
docker/podman/apptainer/singularity run
commands should allow you to use your
host’s NVidia GPUs from inside the resulting container:
--gpus=all --env VK_ICD_FILENAMES=/usr/share/glvnd/egl_vendor.d/10_nvidia.json
New command line arguments and container image name aside, the procedure for starting up a container will be mostly identical to that used for the CPU course. So you will want to get back to the appropriate section of the CPU course’s container setup instructions and follow the instructions for your container and system configuration again.
Once that is done, please run vulkaninfo --summary
inside of a shell within
the container and check that the Vulkan device list matches what you get on the host,
driver version details aside.
Testing your setup
Your Rust development environment should now be ready for this course’s practical work. I strongly advise testing it by running the following script:
curl -LO https://gitlab.in2p3.fr/grasland/numerical-rust-gpu/-/archive/solution/numerical-rust-gpu-solution.zip \
&& unzip numerical-rust-gpu-solution.zip \
&& rm numerical-rust-gpu-solution.zip \
&& cd numerical-rust-gpu-solution/exercises \
&& echo "------" \
&& cargo run --release --bin info -- -p \
&& echo "------" \
&& cargo run --release --bin square -- -p \
&& cd ../.. \
&& rm -rf numerical-rust-gpu-solution
It performs the following actions, whose outcome should be manually checked:
- Run a Rust program that should produce the same device list as
vulkaninfo --summary
. This tells you that any device that gets correctly detected by a C Vulkan program also gets correctly detected by a Rust Vulkan program, as one would expect. - Run another program that uses a simple heuristic to pick the Vulkan device that should be most performant, then uses that device to square an array of floating-point numbers, then checks the results. You should make sure the device selection that this program made is sensible and its final result check passed.
- If everything went well, the script will clean up after itself by deleting all previously created files.