Asynchronous compute
So far, the main loop of our simulation has looked like this:
// Produce the requested amount of concentration tables
for _ in 0..options.runner.num_output_steps {
// [ ... record GPU command buffer ... ]
// Submit the work to the GPU and wait for it to execute
cmdbuf
.build()?
.execute(context.queue.clone())?
.then_signal_fence_and_flush()?
.wait(None)?;
// [ ... process GPU results ... ]
// Set up the next command buffer
cmdbuf = new_cmdbuf()?;
}
This logic is somewhat unsatisfactory because it forces the CPU and GPU to work in a lockstep:
- At first, the CPU is preparing GPU commands while the GPU is waiting for commands.
- Then the CPU submits GPU commands and waits for them to execute, so the GPU is working while the CPU is waiting for the GPU to be done.
- At the end the CPU processes GPU results while the GPU is waiting for commands.
In other words, we expect this sort of execution timeline…
…where “C” stands for command buffer recording, “Execution” stands for execution of GPU work (simulation + results downloads) and “Results” stands for CPU-side results processing.
Command buffer recording has been purposely abbreviated on this diagram as it is expected to be faster than the other two steps, whose relative performance will depends on runtime configuration (relative speed of the GPU and storage device, number of simulation steps per output image…).
In this chapter, we will see how we can make more of these steps happen in parallel.
Command recording
Theory
Because command recording is fast with respect to command execution, allowing them to happen in parallel is not going to save much time. But it will be an occasion to introduce some principles on a simplified problem, and we will later apply the same principles on a broader scale to achieve the more ambitious goal of overlapping GPU computations with CPU result processing.
First of all, by looking back at the main simulation loop above, it should be clear to you that it is not possible for the recording of the first command buffer to happen in parallel with its execution. We cannot execute a command buffer that it is still in the process of being recorded.
What we can do, however, is change the logic of our main simulation loop after this first command buffer has been submitted to the GPU:
- Instead of immediately waiting for the GPU to finish the freshly submitted work, we can start preparing a second command buffer on the CPU side while the GPU is busy working on the first command buffer that we just sent.
- Once that second command buffer is ready, we have nothing else to do on the CPU side (for now), so we will just finish executing our first main loop iteration as before: await GPU execution, then process results.
- By the time we reach the second main loop iteration, we will be able to reap the benefits of our optimization by having a second command buffer that can be submitted right away.
- And then the cycle will repeat: we will prefer the command buffer for the third main loop iteration while the GPU work associated with the second main loop iteration is executing, then we will wait, process the second GPU result, and so on.
This sort of ahead-of-time command buffer preparation will result in parallel CPU/GPU execution through pipelining, a general-purpose optimization technique that can improve execution speed at the expense of some duplicated resource allocation and reduced code clarity.
In this particular case, the resource that is being duplicated is the command buffer. Before we used to have only one command buffer in flight at any point in time. Now we intermittently have two of them, one that is being recorded while another one is executing.
And in exchange for this resource duplication, we expect to get a new execution timeline…
…where command buffer recording and GPU work execution can run in parallel as long as the simulation produces at least two images, resulting in a small performance gain.
Implementation
In terms of code, we will first add a type alias for our favorite kind of
command buffer builder, as these will start popping up more often in our code
and the vulkano
type for that has an uncomfortably long name…
/// Convenience type alias for primary command buffer builders
type CommandBufferBuilder = AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>;
…then we will extract the part of our code that appends simulation commands and builds the command buffer into an anonymous function, declared before the start of the main simulation loop, for reason that are soon going to become clear:
use std::sync::Arc;
// Record simulation steps and user extras into a command buffer, then build it
let mut record_simulation_and_build =
|cmdbuild: &mut CommandBufferBuilder| -> Result<Arc<PrimaryAutoCommandBuffer>> {
// Schedule some simulation steps
schedule_simulation(options, &pipelines, cmdbuild, &mut concentrations)?;
// Schedule any other user-requested work after the simulation
schedule_after_simulation(concentrations.current(), cmdbuild)?;
// Extract the old command buffer builder, replacing it with a blank one
let old_cmdbuild = std::mem::replace(cmdbuild, new_cmdbuf()?);
// Build the command buffer
Ok(old_cmdbuild.build()?)
};
One thing worth pointing out here is the use of the std::mem::replace()
function. This standard library utility is used to work around a current
limitation of the Rust compiler’s static ownership and borrowing analysis, which
prevents it from accepting the following valid alternative:
let new_cmdbuild = new_cmdbuf()?;
let old_cmdbuild = *cmdbuild;
*cmdbuild = new_cmdbuild;
Click here for more details
The problem that the compiler is trying to protect us from is that because the
record_simulation_and_build()
function receives only a mutable reference to
the command buffer builder, it is not allowed to move the builder away, unless
it somehow replaces it with a different builder before the function returns.
This has to be the case otherwise we could get use-after-move Undefined Behavior…
// On the caller side
// Set up a command buffer builder
let mut cmdbuild = new_cmdbuf();
// Lend builder reference to record_simulation_and_build
record_simulation_and_build(&mut cmdbuild)?;
// Should still be able to use a builder after lending a reference to it, so
// this should not be use-after-move undefined behavior:
let cmdbuf = cmdbuild.build()?;
…but at present time, the associated static analysis is overly convervative,
and only lets us move a value away if it’s immediately replaced in the same
program instruction. Several utilities from
std::mem
let us work around
this compiler limitation, std::mem::replace()
is one of them.
Now that we have this new logic, we are going to manipulate a mixture of command
buffer builders and command buffers in main()
, so our former convention of
naming command buffer builders cmdbuf
is becoming confusing there. We
therefore rename the local variable to cmdbuild
…
let mut cmdbuild = new_cmdbuf()?;
…make sure a first command buffer gets built before the main simulation loop starts, as it will now assume the presence of a previously built command buffer…
// Set up the first command buffer
let mut cmdbuf = record_simulation_and_build(&mut cmdbuild)?;
…and we are finally ready to adapt the main simulation loop to our new pipelined logic:
// Produce the requested amount of concentration tables
for _ in 0..options.runner.num_output_steps {
// Submit the command buffer to the GPU and wait for it to execute
let future = cmdbuf
.execute(context.queue.clone())?
.then_signal_fence_and_flush()?;
// Prepare the next command buffer
cmdbuf = record_simulation_and_build(&mut cmdbuild)?;
// Wait for the GPU to be done
future.wait(None)?;
// Perform operations after the GPU is done
after_gpu_wait()?;
}
Are we done yet? Not quite! Our new pipelined simulation logic is not perfectly equivalent to the old one, because it builds one extra useless command buffer at the end, which will have a notable negative performance impact on simulation runs that produce few images.
In real-world code, this will rarely matter. But what if it does for your particular use case? Click here to find out what code changes are needed to eliminate this unnecessary work.
Unfortunately, another Rust compiler static analysis limitation will get in the
way here and force us to logically break the invariant that a new command
buffer may be present at the start of each loop iteration by making the cmdbuf
object optional.
// Set up the first command buffer, if any
let num_output_steps = options.runner.num_output_steps;
let mut cmdbuf = Some(record_simulation_and_build(&mut cmdbuild)?);
We will then name our main loop’s counter so we can detect the last iteration…
for output_step in 0..num_output_steps {
/* ... */
}
…modify the command buffer submission logic so that it assumes that a command
buffer is present and take()
s it from the original Option
, which moves it
away and replaces it with None
…
let future = cmdbuf
.take()
.expect("If this iteration executes, a command buffer should be present")
.execute(context.queue.clone())?
.then_signal_fence_and_flush()?;
…and finally modify the command buffer building logic so that it does not run
on the last simulation loop iteration, leaving the Option
at None
.
// Prepare the next command buffer, if any
if output_step != num_output_steps - 1 {
cmdbuf = Some(record_simulation_and_build(&mut cmdbuild)?);
}
The reason why we need to go through this dance with Option
is that if we
simply attempted to do this in the obvious Option
-free manner…
// Prepare the next command buffer, if any
if output_step != num_output_steps - 1 {
cmdbuf = record_simulation_and_build(&mut cmdbuild)?;
}
…the Rust compiler’s static analysis will, as of 2025, not yet understand that
it is fine for cmdbuf
be left in an invalid moved-from state when output_step == num_output_steps - 1
, because that condition will only be true upon reaching
the last iteration of the loop and cmdbuf
will not be used after the loop
ends. And therefore the compiler will wrongly complain that a moved-from
cmdbuf
may be used on the (nonexistent) next loop iteration.
Conclusion
After this coding interlude, we are ready to reach some preliminary conclusions:
- Pipelining is an optimization that can be applied when a computation has two steps A and B that execute on different hardware, where step A produces an output that step B consumes.
- Pipelining allows you to run steps A and B in parallel, at the expense of…
- Needing to juggle with multiple copies of the output of step A (which will typically come at the expense of a higher application memory footprint).
- Having a more complex initialization procedure before your main loop, in order to bring your pipeline to the fully initialized state that your main loop expects.
- Needing even more complex logic to avoid unnecessary work at the end of the main loop, if you have real-world use cases where the number of main loop iterations is small enough that this extra work has non-negligible overhead.
As for performance benefits, you have been warned at the beginning of this section that command buffer recording is only pipelined here because it gives you an easier introduction to this optimization, and not because the author considers it to be worthwhile in this simulation.
And indeed, even if we wiggle our magic microbenchmarking wand and make GPU-to-CPU transfers and storage I/O disappear, the asymptotic performance benefit on larger problems will be small:
run_simulation/16x16workgroup/2048x1024domain/32outputs/32steps/compute
time: [194.64 ms 195.80 ms 197.04 ms]
thrpt: [10.899 Gelem/s 10.968 Gelem/s 11.033 Gelem/s]
change:
time: [-4.9268% -4.0704% -3.2867%] (p = 0.00 < 0.05)
thrpt: [+3.3984% +4.2432% +5.1821%]
Performance has improved.
Results processing
Theory
Encouraged by the modest but tangible performance improvements that pipelined command buffer recording brought, you may now try to achieve full CPU/GPU execution pipelining, in which GPU-side work execution and CPU-side results processing can overlap…
…but your first attempt will likely end with a puzzling compile-time or
run-time error, which you will stare at blankly for a few minutes of
incomprehension, feeling betrayed by your software infrastructure, before you
figure it out and thank rustc
or vulkano
for saving you from yourself.
Indeed there is a trap with this form of pipelining, and one that is easy to fall into: if you are not careful, you are likely to end up trying to access a simulation result on the CPU side, that the GPU could be simultaneously overwriting with a newer result at the same time. Which is the textbook example of a variety of undefined behavior known as a data race.
To avoid this data race, we will need to add double buffering to our CPU-side
VBuffer
abstraction1, so that our CPU code can read result N at the same
time as our GPU code is busy producing result N+1 and transferring it to the CPU
side. And the logic behind our main simulation loop is going to become a little
bit more complicated again, because now we need to…
- Make sure that by the time we enter the main simulation loop, a result is already available or in the process of being produced. Indeed, the clearest way to write pipelined code is to write each iteration of our main loop under the assumption that the pipeline is already operating at full capacity, taking any required initialization step to get there before the looping begins.
- Rethink our CPU-GPU synchronization strategy so that the CPU code waits for a GPU result to be available before processing it, but does not start processing a result before having scheduled the production of the next result.
TODO: Explain code, comment benchmarks, exercise
-
Another example of how pipelining requires duplication of (in this case memory) resources. ↩