Execution
It has been quite a long journey, but we are now reaching the end of our Vulkan tour!
In the last chapter, we have supplemented the GPU compute pipeline that we have set up previously with a set of memory resources. We can use those to initialize our dataset on the CPU side, move it to the fastest kind of GPU memory available, bind that fast memory to our compute pipeline, and bring results back to the CPU side once the computation is over.
There is just one missing piece before we turn these various components into a fully functional application. But it is quite a big one. How do we actually ask the GPU to perform time-consuming work, like copying data or executing compute pipelines? And how do we know when that work is over? In other words, how do we submit and synchronize with GPU work?
In this chapter, we will finally answer that question, which will allow us to put everything together into a basic GPU application, complete with automated testing and performance benchmarks.
Vulkan execution primer
Problem statement
Because they are independent pieces of hardware, CPUs and GPUs have a natural ability to work concurrently. As the GPU is processing some work, nothing prevents the CPU from doing other work on its side. Which can be useful even in “pure GPU” computations where that CPU work serves no other purpose than to collect and save GPU results or prepare the execution of more GPU work.
Competent GPU programmers know this and will leverage this concurrent execution capability whenever it can help the application at hand, which is why synchronous GPU APIs that force the CPU to wait for the GPU when it doesn’t need to are a performance crime. In the world of GPU programming, any GPU command that can take a nontrivial amount of time to process, and is expected to be used regularly throughout an application’s lifetime, should be asynchronous.
But experience with Vulkan’s predecessor OpenGL, whose implementors already understood the importance of asynchronism, revealed that there is a lot more to GPU command execution performance than making every major API command asynchronous:
- Sending commands from the CPU to the GPU comes at a significant cost. If the API does not allow applications to amortize this cost by sending commands in batches, then GPU drivers will have to batch them on their own, resulting in unpredictable delays between the moment where applications call API commands and the moment where they start executing.
- Scheduling GPU work also involves some CPU work, which can accumulate into significant overhead in applications that need lots of short-running GPU commands. Distributing this CPU work across multiple CPU threads could give applications more headroom before it becomes a performance bottleneck… if OpenGL’s global state machine design did not get in the way.
- Assuming several CPU threads do get involved, funneling all commands through a single command submission interface can easily become a bottleneck. So GPU hardware provides multiple submission interfaces, but the only way to use them with zero thread contention is for GPU APIs to expose them in such a way that each CPU thread can get one.
- Once we accept the idea of multiple submission interfaces, it is not that much of a stretch to introduce specialized submission interfaces for commands that have a good chance to execute in parallel with other ones. This way, GPU hardware and drivers do not need to look far ahead in the command stream to find such commands, then prove that executing them earlier than expected won’t break the application’s sequential command execution logic.
- Speaking of sequential command execution, the promise made by older GPU APIs that GPU commands will execute one after another without observable overlap is fundamentally at odds with how modern pipelined and cache-incoherent GPU hardware works. Maintaining this illusion requires GPU drivers to automatically inject many pipeline and memory barriers, causing reduced hardware utilization and some CPU overhead. It would be great if applications could control this mechanism to selectively allow the kinds of command execution overlap and cache incoherence that do not affect their algorithm’s correctness.
- But there is more to GPU work synchronization than command pipelining
control. Sometimes, CPU threads do need to wait for GPU commands to be done
executing some work. And now that we have multiple channels for GPU work
submission, we also need to think about inter-command dependencies across
those channels, or even across multiple GPUs. OpenGL provided few tools to do
this besides the
glFinish()
sledgehammer which waits for all previously submitted work to complete, thus creating a humonguous GPU pipeline bubble while imposing a lot more waiting than necessary on the CPU. For any application of even mild complexity, finer-grained synchronization would be highly desirable.
What Vulkan provides
Building upon decades of OpenGL application and driver experience, Vulkan set out to devise a more modern GPU command submission and synchronization model that resolves all of the above problems, at the expense of a large increase in conceptual complexity:
- Commands are not directly submitted to the GPU driver, but first collected in batches called command buffers (solving problem #1). Unlike in OpenGL, recording a command buffer does not involve modifying any kind of global GPU/driver state, so threads can easily record command buffers in parallel (solving problem #2).
- Multiple hardware command submission channels are exposed in the API via queues (solving problem #3), which are grouped into queue families to express specialization towards particular kinds of work that is more likely to execute in parallel (solving problem #4).
- GPUs may overlap command execution in any way they like by default, without enforcing a consistent view of GPU memory across concurrent commands. Applications can restrict this freedom whenever necessary by inserting pipeline and memory barriers between two commands within a command buffer (solving problem #5).
- While Vulkan still has a device-wide “wait for idle” operation, which is
supplemented by a threading-friendly queue-local version, it is strongly
discouraged to use such synchronization for any other purpose than debugging.
Finer-grained synchronization primitives are used instead for everyday work
(solving problem #6):
- Fences let CPU code wait for a specific batch of previously submitted commands. They are specialized for everyday “wait for CPU-accessible buffers to be filled before reading them” scenarios, and should be the most efficient tool for these use cases.
- Events allow GPU commands within a queue to await a signal that can be sent by a previous command within the same queue (as a more flexible but more expensive alternative to the barriers discussed above) or by host code.
- Semaphores provide maximal flexibility at the expense of maximal overhead. They can be signaled by the host or by a GPU command batch, and can be awaited by the host or by another GPU command batch. They are the only Vulkan synchronization primitive that allows work to synchronize across GPU queues without CPU intervention.
The vulkano
layer
As you can imagine, all this new API complexity can take a while to master and has been a common source of application correctness and performance bugs. This proved especially true in the area of command pipelining, where Vulkan’s “allow arbitrary overlap and incoherence by default” strategy has proven to be a remarkably powerful source of application developer confusion.
Other modern performance-oriented GPU APIs like Apple Metal and WebGPU have thus refused to follow this particular path. Their design rationale was that even though the latency increase caused by forbidding observable command overlap cannot be fully compensated, most of the associated throughput loss can be compensated by letting enough commands execute in parallel across device queues, and for sufficiently complex applications that should be good enough.
But that is forgetting a bit quickly that Vulkan is about choice. When the GPU API provides you with the most performant but least ergonomic way to do something, nothing prevents you from building a higher-level layer on top of it that improves ergonomics at the expense of some performance loss. Whereas if you start from a higher-level API, making it lower-level to improve performance at the expense of ergonomics can be impossible. This is why good layered abstraction design with high-quality low-level layers matter, and in this scheme Vulkan was designed to be the ultimate low-level layer, not necessarily a high-level layer that all applications should use directly.
In the case of Vulkan programming in Rust, we have vulkano
for this purpose,
and in this area like others it delivers as expected:
- As a default choice, the high-level
AutoCommandBufferBuilder
layer implements a simple Metal-like command queuing model. It should provide good enough performance for typical numerical computing applications, with much improved ergonomics over raw Vulkan. - If you ever face a performance problem that originates from the resulting lack
of GPU command overlap, or from the overhead of CPU-side state tracking (which
automatic barrier insertion entails), that is not the end of the world. All
you will need to do is to reach for the lower-level
RecordingCommandBuffer
unsafe layer, and locally face the full complexity of the Vulkan command pipelining model in the areas of your applications that need it for performance. The rest of the application can remain largely untouched.
For the purpose of this course, we will not need lower-level control than what
the high-level safe vulkano
layer provides, so the remainder of this chapter
will exclusively use that layer.
Command buffer
As mentioned above, Vulkan requires any nontrivial and potentially recuring GPU work to be packaged up into a command buffer before it can be submitted to the GPU for execution. In our first number-squaring example, the part of the work that qualifies as command buffer worthy is…
- Copying CPU-generated inputs to the fastest available kind of GPU memory
- Binding the compute pipeline so that future execution (dispatch) commands refer to it
- Binding the fast GPU buffer descriptor set so that the compute pipeline uses it
- Executing (dispatching) the compute pipeline with a suitable number of workgroups
- Copying the output back to CPU-accessible memory
…and using vulkano
, we can build a command buffer that does this as follows:
use vulkano::{
command_buffer::{
auto::{AutoCommandBufferBuilder, PrimaryAutoCommandBuffer},
CommandBufferUsage, CopyBufferInfo,
},
pipeline::PipelineBindPoint,
};
/// Build a command buffer that does all the GPU work
pub fn build_command_buffer(
context: &Context,
cpu_input: Subbuffer<[f32]>,
gpu_data: Subbuffer<[f32]>,
gpu_pipeline: Pipeline,
gpu_data_desc: Arc<DescriptorSet>,
pipeline_options: &PipelineOptions,
cpu_output: Subbuffer<[f32]>,
) -> Result<Arc<PrimaryAutoCommandBuffer>> {
// Set up a primary command buffer
let mut builder = AutoCommandBufferBuilder::primary(
context.comm_allocator.clone(),
context.queue.queue_family_index(),
CommandBufferUsage::OneTimeSubmit,
)?;
// Copy CPU inputs to the GPU side
builder.copy_buffer(CopyBufferInfo::buffers(cpu_input, gpu_data.clone()))?;
// Bind the compute pipeline for future dispatches
builder.bind_pipeline_compute(gpu_pipeline.compute)?;
// Bind memory to the compute pipeline
builder.bind_descriptor_sets(
PipelineBindPoint::Compute,
gpu_pipeline.layout,
DATA_SET,
gpu_data_desc,
)?;
// Execute the compute pipeline with an appropriate number of work groups
let num_work_groups = cpu_output
.len()
.div_ceil(pipeline_options.workgroup_size.get() as u64);
// SAFETY: GPU shader has been checked for absence of undefined behavior
// given a correct execution configuration, and this is one
unsafe {
builder.dispatch([num_work_groups as u32, 1, 1])?;
}
// Retrieve outputs back to the CPU side
builder.copy_buffer(CopyBufferInfo::buffers(gpu_data, cpu_output))?;
// Finalize the command buffer object
Ok(builder.build()?)
}
As usual, a few things should be pointed out about this code:
- That’s a lot of function parameters! Which comes from the fact that this
function asks the GPU to do many different things. Our example code is written
that way because it allows us to introduce Vulkan concepts in a more logical
order, but real-world Vulkan apps would benefit from spreading the command
recording process across more functions that each take an
&mut AutoCommandBufferBuilder
as a parameter.- Generally speaking, functions should favor
&mut AutoCommandBufferBuilder
over building command buffers internally until you are ready to submit work to the GPU. This allows you to pack your GPU work into as few command buffers as possible, which may improve command execution efficiency.1
- Generally speaking, functions should favor
- We are building a primary command buffer, which can be directly submitted to the GPU. This is in contrast with secondary command buffers, which can be inserted into primary command buffers. The latter can be used to avoid repeatedly recording recurring commands,2 and it also combines really well with a Vulkan graphics rendering feature called render passes that falls outside of the scope of this compute-focused introductory course.
Execution
The high-level vulkano
API generally tries to stay close to the underlying
Vulkan C API, using identical concepts and naming. Deviations from raw Vulkan
must be motivated by the desire to be memory-, type- and thread-safe by default,
in line with Rust’s design goals. This is good as it makes it easier to take
documentation about the Vulkan C API and apply it to vulkano
-based programs.
However, there is one area where vulkano
’s high-level API strays rather far
from the concepts of its Vulkan backend, and that is command execution and
synchronization. This makes sense because that part of the vulkano
API needs
to guard against safety hazards related to GPU hardware asynchronously reading
from and writing to CPU-managed objects, which is quite difficult.3
Instead of closely matching the Vulkan functions used for command buffer
submission (like vkSubmit()
) and synchronization (like vkWaitForFences()
),
the high-level vulkano
API for command submission and synchronization
therefore currently4 works as follows:
- Prepare to send a
PrimaryAutoCommandBuffer
to aQueue
using itsexecute()
method. This produces aCommandBufferExecFuture
, which is a special kind ofGpuFuture
. GPU future objects model events that will eventually occur, in this case the moment where commands within this particular command buffer will be done executing.- It is very important to understand that at this point, the command buffer
has not been sent to the GPU. Indeed,
vulkano
must expose the fact that Vulkan lets us send multiple command buffers to the GPU with a singlevkSubmit()
operation in order to keep performance competitive against direct use of the Vulkan C API. - Other GPU future objects are available, representing things like the signal that our system is ready to render a new frame in real-time graphics.
- It is very important to understand that at this point, the command buffer
has not been sent to the GPU. Indeed,
- Chain as many of these futures as desired using methods like
then_execute()
andjoin()
, which respectively represent sequential and concurrent execution. This will produce aGpuFuture
object that represents an arbitrarily complex graph of GPU tasks that may or may not be linked by sequential execution dependencies. - Indicate points where Vulkan synchronization objects (fences and semaphores)
should be signaled using GPU future methods like
then_signal_fence()
andthen_signal_semaphore()
. - Submit all previously scheduled work to the GPU whenever desired using the
flush()
method. This does not destroy the associated GPU future object so that you can keep scheduling more work after the work that has just started executing.- One will often want to do this at points of the asynchronous task graph that
are observable through signaling of a Vulkan synchronization object, which
is why
vulkano
providesthen_signal_fence_and_flush()
andthen_signal_semaphore_and_flush()
API shortcuts for this use case.
- One will often want to do this at points of the asynchronous task graph that
are observable through signaling of a Vulkan synchronization object, which
is why
- When the CPU momentarily runs out of work to submit to the GPU, await
completion of some previous work using the
wait()
method of the particular GPU future type that is returned by thethen_signal_fence()
operation. Because making a CPU wait for GPU work for an unbounded amount of time is bad, polling and timeout options are also available here.
In the context of our number-squaring example, we can use this API as follows:
use vulkano::{
command_buffer::PrimaryCommandBufferAbstract,
sync::future::GpuFuture,
};
/// Synchronously execute the previously prepared command buffer
pub fn run_and_wait(context: &Context, commands: Arc<PrimaryAutoCommandBuffer>) -> Result<()> {
commands
.execute(context.queue.clone())?
.then_signal_fence_and_flush()?
.wait(None)?;
Ok(())
}
Notice we need to bring a few traits into scope in order to use the execute()
method (whose implementation is shared across all kinds of primary command
buffers using the PrimaryCommandBufferAbstract
trait) and the GpuFuture
trait’s methods (whose implementation is also shared across all kinds of GPU
futures.
Exercises
Final executable
Now that we have all parts of the computation ready, we can put them all
together into a complete program by adding all of the above functions to
exercises/src/square.rs
, then rewriting the exercises/src/bin/square.rs
binary’s source into the following:
use clap::Parser;
use grayscott_exercises::{
context::{Context, ContextOptions},
square::{self, Pipeline, PipelineOptions},
Result,
};
/// This program generates and squares an array of numbers
#[derive(Parser, Debug)]
#[command(version, author)]
struct Options {
/// Vulkan context configuration
#[command(flatten)]
context: ContextOptions,
/// Compute pipeline configuration
#[command(flatten)]
pipeline: PipelineOptions,
/// Input data configuration
#[command(flatten)]
input: InputOptions,
}
fn main() -> Result<()> {
// Decode CLI arguments
let options = Options::parse();
// Set up a generic Vulkan context
let context = Context::new(&options.context)?;
// Set up a compute pipeline
let pipeline = Pipeline::new(&context, &options.pipeline)?;
// Set up memory resources
let cpu_input = square::setup_cpu_input(&context, &options.input)?;
let gpu_data = square::setup_gpu_data(&context, &options.input)?;
let gpu_data_desc = square::setup_descriptor_set(
&context,
&pipeline,
gpu_data.clone(),
)?;
let cpu_output = square::setup_cpu_output(&context, &options.input)?;
// Build a command buffer
let commands = square::build_command_buffer(
&context,
cpu_input.clone(),
gpu_data,
pipeline,
gpu_data_desc,
&options.pipeline,
cpu_output.clone(),
)?;
// Synchronously execute the command buffer
square::run_and_wait(&context, commands)?;
// Check computation results
let cpu_input = cpu_input.read()?;
let cpu_output = cpu_output.read()?;
assert!((cpu_input.iter())
.zip(cpu_output.iter())
.all(|(input, output)| *output == input.powi(2)));
println!("All numbers have been squared correctly!");
Ok(())
}
Notice the use of the
Subbuffer::read()
method at the end, which is needed in order to allow vulkano
to check for
absence of data races between the CPU and the GPU.
Execute this binary in debug mode, then in release mode, while measuring execution times:
cargo build --bin square >/dev/null \
&& time cargo run --bin square \
&& cargo build --release --bin square >/dev/null \
&& time cargo run --release --bin square
Do you understand the outcome? If not, think about what you are measuring for a bit to see if you can figure it out for yourself, before moving to the next part.
Benchmarks
While timing complete programs like square
can be nice for early performance
exploration, as soon as you start getting into the realm of performance
optimization it is good to figure out which part of the program is most critical
to performance and measure it separately.
A benchmark has been set up to this end in exercises/benches/square.rs
. Once
you are done with the above exercise, it should compile and run. Try to run it
with cargo bench --bench square
, and use it to determine…
- Which parts of the process are slowest on the various devices exposed by your system.
- How their performance scales with the various tunable parameters.
Note that the cargo bench command also accepts a regular expression argument that can be used to only run selected benchmarks. It can be used like this:
cargo bench --bench square -- "(input1000|workgroup64)/"
These benchmarks are pretty long-running so…
- If you are running them on a laptop, plug in the power adapter
- If you are running them on an HPC center, run them on a worker node
- Do not hesitate to stop them ahead of time once you have seen what you want to see, then re-run them in a more restricted configuration.
One thing you can also to is to modify this program so that
build_command_buffer
schedules multiple compute shader dispatches instead of
one (i.e. instead of only squaring numbers, you elevate them to the power of 2,
4, 8, 16…). Note that this does not require you to repeatedly re-bind the
compute pipeline and data descriptor set.
Modify the program to measure performance at various numbers of compute shader dispatches. What do you observe, and what does this tell you about the most likely GPU-side performance bottleneck(s) of this number-squaring program?
Optimizations
This last exercise may take you a significant amount of time and should only be worked on if you finished well ahead of the expected end of the course.
Get back your list of CPU-GPU data handling strategies from the exercise of the last chapter, implement one of them (preferably one of those that are easy to implement), and use the benchmark to measure how it affects execution performance.
-
Ignoring for a moment the question of how expensive it is to create and build a command buffer (cheap but not free as usual), some Vulkan implementations are known to insert pipeline barriers between consecutive command buffers, most likely in an attempt to simplify the implementation of synchronization primitives that operate at command buffer boundaries like fences. ↩
-
…but reusing command buffers like this comes at the cost of losing some GPU driver optimizations. It is therefore recommended that applications first attempt to resolve command recording bottlenecks by spreading the command recording load across multiple CPU threads, before falling back to secondary command buffers if that is not enough. ↩
-
Rust’s compile-time safety proofs build on the fact that the compiler has full knowledge of the program execution and data flow, which is not true in the presence of asynchronous GPU work execution. This problem is also encountered when designing CPU threading APIs, where it is commonly restored by introducing synchronization points that wait for all tasks to be finish executing and release data references (a design known as fork-join or structured concurrency). But as previously discussed, such synchronization points are unwelcome in GPU programming, which is why
vulkano
has instead gone for a run-time safety tracking mechanism that was largely custom-built for its use case. ↩ -
The
GpuFuture
abstraction has a number of known flaws, and there is work ongoing in thevulkano-taskgraph
crate with the aim of eventually replacing it. But that work is not ready for prime time yet. ↩