Context
In the previous chapter, we went through the process of loading the system’s Vulkan library, querying its properties, and setting up an API instance, from which you can query the set of “physical”1 Vulkan devices available on your system.
After choosing one or more2 of these devices, the next thing we will want to
do is set them up, so that we can start sending API commands to them. In this
chapter, we will show how this device setup is performed, then cover a bit of
extra infrastructure that you will also usually want in vulkano
-based
programs, namely object allocators and pipeline caches.
Together, the resulting objects will form a minimal vulkano
API context that
is quite general-purpose: it can easily be extracted into a common library,
shared between many apps, and later extended with additional tuning knobs if you
ever need more configurability.
Device selection
As you may have seen while going through the exercise at the end of the previous chapter, it is common for a system to expose multiple physical Vulkan devices.
We could aim for maximal system utilization and try to use all devices at the same time, but such multi-device computations are surprisingly hard to get right.3 In this introductory course, we will thus favor the simpler strategy of selecting and using a single Vulkan device.
This, however, begs the question of which device we should pick:
- We could just pick the first device that comes in Vulkan’s device list, which is effectively what OpenGL programs do. But the device list is ordered arbitrarily, so we may face issues like using a slow integrated GPU on “hybrid graphics” laptops that have a fast dedicated GPU available.
- We could ask the user which device should be used. But prompting that on every run would get annoying quickly. And making it a mandatory CLI argument would violate the basic UX principle that programs should do something sensible in their default configuration.
- We could try to pick a “best” device using some heuristics. But since this is an introductory course we don’t want to spend too much time on fine-tuning the associated logic, so we’ll go for a basic strategy that is likely to pick the wrong device on some systems.
To balance these pros and cons, we will use a mixture of strategies #2 and #3 above:
- Through an optional CLI argument, we will let users explicitly pick a device
in Vulkan’s device list using the numbering exposed by the
info
utility when they feel so inclined. - When this CLI argument is not specified, we will rank devices by device type (discrete GPU, integrated GPU, CPU emulation…) and pick a device of the type that we expect to be most performant. This is enough to resolve simple4 multi-device ambiguities, such as picking between a discrete and integrated GPU or between a GPU and an emulation thereof.
This device selection strategy makes can be easily implemented using Rust’s iterator methods. Notice that strings can be turned into errors for simple error handling.
use crate::Result;
use clap::Args;
use std::sync::Arc;
use vulkano::{
device::physical::{PhysicalDevice, PhysicalDeviceType},
instance::Instance,
};
/// CLI parameters that guide device selection
#[derive(Debug, Args)]
pub struct DeviceOptions {
/// Index of the Vulkan device that should be used
///
/// You can learn what each device index corresponds to using
/// the provided "info" program or the standard "vulkaninfo" utility.
#[arg(short, long)]
pub device_index: Option<usize>,
}
/// Pick a physical device
fn select_physical_device(
instance: &Arc<Instance>,
options: &DeviceOptions,
quiet: bool,
) -> Result<Arc<PhysicalDevice>> {
let mut devices = instance.enumerate_physical_devices()?;
if let Some(index) = options.device_index {
// If the user asked for a specific device, look it up
devices
.nth(index)
.inspect(|device| {
if !quiet {
eprintln!(
"Selected requested device {:?}",
device.properties().device_name
)
}
})
.ok_or_else(|| format!("There is no Vulkan device with index {index}").into())
} else {
// Otherwise, choose a device according to its device type
devices
.min_by_key(|dev| match dev.properties().device_type {
// Discrete GPUs are expected to be fastest
PhysicalDeviceType::DiscreteGpu => 0,
// Virtual GPUs are hopefully discrete GPUs exposed
// to a VM via PCIe passthrough, which is reasonably cheap
PhysicalDeviceType::VirtualGpu => 1,
// Integrated GPUs are usually much slower than discrete ones
PhysicalDeviceType::IntegratedGpu => 2,
// CPU emulation of GPUs is not known for being efficient...
PhysicalDeviceType::Cpu => 3,
// ...but it's better than other types we know nothing about
PhysicalDeviceType::Other => 4,
_ => 5,
})
.inspect(|device| {
if !quiet {
eprintln!("Auto-selected device {:?}", device.properties().device_name)
}
})
.ok_or_else(|| "No Vulkan device available".into())
}
}
Notice the quiet
boolean parameter, which suppresses console printouts about
the GPU device in use. This will come in handy when we will benchmark context
building at the end of the chapter.
Device and queue setup
Once we have selected a PhysicalDevice
, we must set it up before we can use
it. There are similarities between this process and that of building an
Instance
from a
VulkanLibrary
:
in both cases, after discovering what our system could do, we must specify
what it should do.
One important difference, however, is that the device setup process produces
more than just a Device
object, which is used in a wide range of
circumstances from compiling GPU programs to allocating GPU resources. It also
produces a set of Queue
objects, which we will later use to submit commands
for asynchronous execution.
These asynchronous commands are very important because they implement the tasks that a well-optimized Vulkan program will spend most of its GPU time doing. For example, they can be used to transfer data between CPU and GPU memory, or to execute GPU code.
We’ll give this command scheduling process the full attention it deserves in a subsequent chapter, but at this point, the main thing you need to know is that a typical GPU comes with not one, but several hardware units capable of receiving commands from the CPU and scheduling them for execution on the GPU. These command scheduling units have the following characteristics:
- They operate in parallel, but the underlying hardware resources on which submitted work aventually executes are shared between them.
- They process commands in a mostly FIFO fashion, and are thus called queues in
the Vulkan specification. But they do not fully match programmer intuition
about queues, because they also have a limited and hardware-dependent ability
to run some commands in parallel.
- For example, if a GPU program does not fully utilize available execution resources and the next command schedules execution of another GPU program, the two programs may end up running concurrently.
- Due to hardware limitations, you will often need to submit commands to several queues concurrently in order to fully utilize the GPU’s resources.
- Some queues may be specialized in executing specific kinds of commands (e.g. data transfer commands) and unable to execute other kinds of commands.
Vulkan exposes this hardware feature in the form of queue families whose basic
properties
can be queried from a
PhysicalDevice
.
Each queue family represents a group of hardware queues. At device
initialization time, we must request the creation of one or more logical queues
and specify which hardware queues they should map to.
Unfortunately, the Vulkan API really provides little information about queue families, and it will often take a round trip through manufacturer documentation to get a better understanding of what the various queue families represent in hardware and how multiple hardware queues should be used.
However, our introductory number-squaring program is so simple that it does not benefit that much from multiple Vulkan queues anyway. Therefore, in this first part of the course, we can take the shortcut of allocating a single queue that maps into the first queue family that supports compute operations (which, per the Vulkan specification, implies support for data transfer operations).
use vulkano::device::QueueFlags;
/// Pick the first queue family that supports compute operations
///
/// While the Vulkan specification does not mandate that such a queue family
/// exists, it does mandate that if any family supports graphics operations,
/// then at least one family must support compute operations. And a Vulkan
/// device that supports no graphics operation would be very much unexpected...
fn queue_family_index(device: &PhysicalDevice) -> usize {
device
.queue_family_properties()
.iter()
.position(|family| family.queue_flags.contains(QueueFlags::COMPUTE))
.expect("Device does not support compute (or graphics)")
}
Knowing this queue family index, setting up a device with a single queue from this family becomes rather straightforward:
use vulkano::device::{Device, DeviceCreateInfo, Queue, QueueCreateInfo};
/// Set up a device with a single command queue that can schedule computations
/// and memory transfer operations.
fn setup_device(device: Arc<PhysicalDevice>) -> Result<(Arc<Device>, Arc<Queue>)> {
let queue_family_index = queue_family_index(&device) as u32;
let (device, mut queues) = Device::new(
device,
DeviceCreateInfo {
queue_create_infos: vec![QueueCreateInfo {
queue_family_index,
..Default::default()
}],
..Default::default()
},
)?;
let queue = queues
.next()
.expect("We asked for one queue, we should get one");
Ok((device, queue))
}
As when creating an instance before, this is a place where we could enable optional Vulkan API extensions supported by the physical device. But in the case of devices, these extensions are supplemented by a related concept called features, which represent optional Vulkan API functionality that our device may or may not support.
As you may guess, the nuance between these two concepts is subtle:
- Features do not need to come from extensions, they may exist even in the core
Vulkan specification. They model optional functionality that a device may or
may not support, or that an application may or may not want to enable.
- An example of the former is the ability to perform atomic operations on floating-point data inside GPU programs. Hardware support for these operations varies widely.
- An example of the latter is the ability to make accesses to memory resources bound-checked in order to reduce avenues for undefined behavior. This is important for e.g. web browsers that execute untrusted GPU code from web pages, but comes at a performance cost that performance-sensitive apps may want to avoid.
- Extensions may want to define features even if the mere act of enabling an
extension is arguably an opt-in for optional functionality, if the
functionality of interest can be further broken down into several closely
related sub-parts.
- For example the former
VK_KHR_8bit_storage
extension (now part of Vulkan 1.2 core), which specified the ability for GPU code to manipulate 8-bit integers, provided 3 separate feature flags to represent ability to manipulate 8-bit integers from 3 different kinds of GPU memory resources (storage buffers, uniform buffers, and push constants).
- For example the former
Pipeline cache
In programming languages that favor ahead-of-time compilation like Rust and C/++, compilers know a fair bit about the CPU ISA that the program is destined to run on, enough to emit machine code that the target CPUs can process directly. This allows pure-CPU Rust programs to execute at top speed almost instantly, without the slow starts that plagues programming languages that prefer to postpone compilation work to runtime (just-in-time compilation) like Julia, Java and C# do.
GPU programs, however, cannot enjoy this luxury when hardware portability is desired, because the set of GPU architectures that even a single CPU architecture can host is very large and GPU instruction sets are not designed with backwards compatibility in mind.5 As a result, just-in-time compilation is the dominant paradigm in the GPU world, and slow startup is a common issue in even slightly complex GPU programs.
Over time, various strategies have been implemented to mitigate this issue:
- Following the lead of Java and C#, GPU programming APIs have gradually replaced C-based GPU programming languages with pre-compiled intermediate representations like SPIR-V, which are closer to machine code and can be turned more quickly into an optimized binary for the target GPU hardware. This also had the desirable side-effect of improving the reliability of GPU drivers, which have a notoriously hard time correctly compiling high-level languages.
- GPU drivers have tried to avoid compilation entirely after the first program
run via caching techniques, which lets them reuse previously compiled binaries
if the input program has not changed. Unfortunately, detecting if a program
has changed can be a surprisingly hard problem in the presence of external
dependencies like those brought by the C
#include
directive. And it is unwise to push such fun cache invalidation challenges onto GPU driver developers who are not known for their attention to software quality. Furthermore, making this caching process implicit also prevents GPU applications from supplementing the just-in-time compilation process with pre-compiled binaries for common system configurations, so that programs can run fast right from the first run in some best-case scenarios.
Acknowledging the issues of the overly implicit binary caching approaches of its predecessors,6 Vulkan enforces a more explicit caching model in which applications are in direct control of the cache that holds previously compiled GPU programs. They can therefore easily flush the cache when a fresh compilation is desired, or save it to files and share it across machines as needed.
The provided code library contains a PersistentPipelineCache
struct that
leverages this functionality to cache previously compiled GPU code across
program runs, by saving the pipeline cache into a standard OS location such as
the XDG ~/.cache
directory on Linux. These standard locations are easily
looked up in a cross-platform manner using the
directories
crate. As
vulkano’s PipelineCache
API is rather basic and easy to use, this code is
mostly about file manipulation and not very interesting from a Vulkan teaching
perspective, so we will not describe it here. Please look it up in the provided
example codebase if interested, and ask any question that arises!
Allocators
Ever since the existence of absolute zero temperature has been demonstrated by statistical physics, top minds in cryogenics have devoted enormous resource to get increasingly close to it, to the point where humanity can nowadays reliably cool atom clouds down to millionths of a degree above absolute zero. But awe-inspiring as it may be, this technological prowess pales in comparison to how close GPU driver memory allocators have always been to absolute zero performance.
The performance of GPU driver memory allocators is so incredibly bad that most
seasoned GPU programmers avoid calling the GPU API’s memory allocator at all
costs. They do so through techniques like application side sub-allocation and
automatic allocation reuse, which would be relatively advanced by CPU
programming standards.7 Acknowledging this, vulkano
supports and encourages
the use of application-side memory allocators throughout its high-level API.
Vulkan differentiates three categories of memory objects that are allocated
using completely different APIs, likely because they may map onto different
memories in some GPUs. This unsuprisingly maps into three vulkano
memory
allocator objects that must be set up independently setup (and can be
independently replaced with alternate implementations if needed):
- The
StandardMemoryAllocator
is used to allocate large and relatively long-lived memory resources like buffers and images. These are likely to be what first comes to your mind when thinking about GPU memory allocations. - The
StandardDescriptorSetAllocator
is used to allocate descriptor sets, which are groups of the above memory resources. Resources are grouped like this so that you can attach them to GPU programs using bulk operations, instead of having to do it on a fine-grained basis which was a common performance bottleneck of older GPU APIs. - The
StandardCommandBufferAllocator
can be used to allocate command buffers, which are small short-lived objects entities that are created every time you submit commands to the GPU. As you can imagine, this allocator is at a higher risk of becoming a performance bottleneck than others, which is why Vulkan allows you to amortize its overhead by submitting commands in bulk as we will see in a subsequent chapter.
Since the default configuration is fine for our purposes, setting up these
allocators is rather straightforward. There is just one API curiosity that must
be taken care of, namely that unlike every other object constructor in
vulkano
’s API, the constructors of memory allocators do not automatically wrap
them in atomically reference-counted Arc
pointers. This must be done before
they can be used with vulkano
’s high-level safe API, so you will need to do
this on your side:
use vulkano::{
command_buffer::allocator::StandardCommandBufferAllocatorCreateInfo,
descriptor_set::allocator::StandardDescriptorSetAllocatorCreateInfo,
};
// A few type aliases that will let us more easily switch to another memory
// allocator implementation if we ever need to
pub type MemoryAllocator = vulkano::memory::allocator::StandardMemoryAllocator;
pub type CommandBufferAllocator =
vulkano::command_buffer::allocator::StandardCommandBufferAllocator;
pub type DescriptorSetAllocator =
vulkano::descriptor_set::allocator::StandardDescriptorSetAllocator;
/// Set up all memory allocators required by the high-level `vulkano` API
fn setup_allocators(
device: Arc<Device>,
) -> (
Arc<MemoryAllocator>,
Arc<DescriptorSetAllocator>,
Arc<CommandBufferAllocator>,
) {
let malloc = Arc::new(MemoryAllocator::new_default(device.clone()));
let dalloc = Arc::new(DescriptorSetAllocator::new(
device.clone(),
StandardDescriptorSetAllocatorCreateInfo::default(),
));
let calloc = Arc::new(CommandBufferAllocator::new(
device,
StandardCommandBufferAllocatorCreateInfo::default(),
));
(malloc, dalloc, calloc)
}
Putting it all together
With that, we reach the end of the Vulkan application setup that is rather problem-agnostic and could easily be shared across many applications, given the possible addition of a few extra configuration hooks (e.g. a way to enable Vulkan extensions if our apps use them).
Let’s recap the vulkano
objects that we have set up so far and will need later
in this course:
- A
Device
is the initialized version of aPhysicalDevice
. It is involved in most API operations that optimized programs are not expected to spend a lot of time doing, like setting up compute pipelines or allocating memory resources. To keep this introductory course simple, we will only use a single (user- or heuristically-selected) device. - At device setup time, we also request the creation of one or more
Queue
s. These will be used to submit GPU commands that may take a while to execute and remain frequently used after the initial application setup stage. Use of multiple queues can help performance, but is a bit of a hardware-specific black art so we will not discuss it much. - To avoid recompiling GPU code on each application startup, it is good practice
to set up a
PipelineCache
and make sure that its contents are saved on application shutdown and reloaded on application startup. We provide a simplePersistentPipelineCache
abstraction that handles this in a manner that honors OS-specific cache storage recommendations. - Because GPU driver allocators are incredibly slow, supplementing them with an
application-side allocator that calls into them as rarely as possible is
necessary for optimal performance. We will need one for GPU memory resources,
one for descriptor sets (i.e. sets of memory resources), and one for command
buffers. For this course’s purpose, the default allocators provided by
vulkano
will do this job just fine without any special settings tweaks. - And finally, we must keep around the
DebugUtilsMessenger
that we have set up in the previous chapter, which ensures that any diagnostics message emitted by the Vulkan implementation will still pop up in our terminal for easy debugging.
To maximally streamline the common setup process, we will group all these
objects into a single Context
struct whose constructor takes care of all the
setup details seen so far for us:
/// CLI parameters for setting up a full `Context`
#[derive(Debug, Args)]
pub struct ContextOptions {
/// Instance configuration parameters
#[command(flatten)]
pub instance: InstanceOptions,
/// Device selection parameters
#[command(flatten)]
pub device: DeviceOptions,
}
/// Basic Vulkan setup that all our example programs will share
pub struct Context {
pub device: Arc<Device>,
pub queue: Arc<Queue>,
pipeline_cache: PersistentPipelineCache,
pub mem_allocator: Arc<MemoryAllocator>,
pub desc_allocator: Arc<DescriptorSetAllocator>,
pub comm_allocator: Arc<CommandBufferAllocator>,
_messenger: Option<DebugUtilsMessenger>,
}
//
impl Context {
/// Set up a `Context`
pub fn new(options: &ContextOptions, quiet: bool) -> Result<Self> {
let library = VulkanLibrary::new()?;
let mut logging_instance = LoggingInstance::new(library, &options.instance)?;
let physical_device =
select_physical_device(&logging_instance.instance, &options.device, quiet)?;
let (device, queue) = setup_device(physical_device)?;
let pipeline_cache = PersistentPipelineCache::new(device.clone())?;
let (mem_allocator, desc_allocator, comm_allocator) = setup_allocators(device.clone());
let _messenger = logging_instance.messenger.take();
Ok(Self {
device,
queue,
pipeline_cache,
mem_allocator,
desc_allocator,
comm_allocator,
_messenger,
})
}
/// Get a handle to the pipeline cache
pub fn pipeline_cache(&self) -> Arc<PipelineCache> {
self.pipeline_cache.cache.clone()
}
}
Exercise
For now, the square
binary does nothing but set up a basic Vulkan context as
described above. Run a debug build of it with the following command…
cargo run --bin square
…and make sure that it executes without errors. A few warnings from the validation layers are expected. Some were discussed in the previous chapter, while most of the new ones warn you that the GPU-assisted validation layer has force-enabled a few optional Vulkan features that we do not need, because its implementation does need them.
Once this is done, take a moment to look at the definition of the Context
struct above, and make sure you have a basic understanding of what its
components are doing or will later be useful for. Do not hesitate to quickly
review the previous chapters and the vulkano
documentation as necessary.
If you are curious and relatively ahead of the group in terms of progress,
consider also checking out the constructors of the various vulkano
objects
involved in order to learn more about the many optional features and
configuration tunables that we could have used, but chose not to.
-
Vulkan physical devices may sadly not map into a physical piece of hardware in your computer. For example Linux users will often see the
llvmpipe
GPU emulator in their physical device list. The reason why Vulkan calls them physical devices anyway is that some API naming trick was needed in order to distinguish these uninitialized devices that can just be queried for properties, from the initialized device objects that we will spend most of our time using later on. ↩ -
Part of the reason why Vulkan makes device selection explicit, instead of arbitrarily picking one device by default like most GPU APIs do, is that it makes multi-GPU workflows easier. Since you are always specifying which device you are using as a parameter your Vulkan commands, refactoring a program that uses a single GPU to use multiple ones is easier when using Vulkan. This is great because single-device programs are easier to write and test and therefore best for initial prototyping. ↩
-
Among other things, multi-GPU programs may require load balancing between devices of unequal performance capabilities, more complex profiling and debugging workflows, careful balance between the goals of using all available computing power and avoiding slow cross-device communication… and these are just the most obvious issues. More advanced concerns include the inefficiency of using a CPU-based GPU emulation compared to an optimized CPU implementation, and thermal throttling issues that arise when firing up multiple devices that share a common heatsink like a CPU and its integrated GPU. ↩
-
One example of a system environment where this simple strategy is not good enough would be a worker node in an HPC center running an older version of the Slurm scheduler. These nodes typically contain a number of nearly-identical GPUs that only differ by PCI bus address and UUID. Older versions of Slurm would expose all GPUs to your program, but tell it which GPUs were allocated to your job using an environment variable whose name and syntax is specific to the underlying GPU vendor. Vendor-specific compute runtimes like NVidia CUDA and AMD ROCm would then parse these environment variables and adjust their implicit device selection strategy accordingly. As you can imagine, implementing this sort of vendor-specific hackery does not amuse the Vulkan programmer, but thankfully newer versions of Slurm have finally learned how to hide unallocated GPUs using
cgroups
. ↩ -
Even binary format compatibility is not guaranteed, so a GPU driver update can be all it takes to break binary compatibility with previously compiled GPU programs. ↩
-
To be fair, an attempt was made in previous GPU APIs like OpenGL and OpenCL to allow programmers to export and manage pre-compiled GPU modules and programs. But it was later discovered that this feature had to be supplemented with extra compilation and caching on the GPU driver side, which defeated its purpose. Indeed, the most optimized version of a GPU program could depend on some specifics of how memory resources were bound to it, and in legacy APIs this was not known until resource binding time, which would typically occur after unsuspecting developers had already exported their GPU binaries. This is why the notion of graphics and compute pipelines, which we will cover soon, was introduced into Vulkan. ↩
-
Largely because any self-respecting
libc
memory allocator implementation already features these optimizations. Which means that it is only in relatively niche use cases that programmers will benefit from re-implementing these optimizations themselves, without also coming to the realization that they are doing a lot more memory allocations than they should and could achieve a much greater speedup by rethinking their memory management strategy entirely. ↩