Asynchronous compute

So far, the main loop of our simulation has looked like this:

// Produce the requested amount of concentration tables
for _ in 0..options.runner.num_output_steps {
    // [ ... record GPU command buffer ... ]

    // Submit the work to the GPU and wait for it to execute
    cmdbuf
        .build()?
        .execute(context.queue.clone())?
        .then_signal_fence_and_flush()?
        .wait(None)?;

    // [ ... process GPU results ... ]

    // Set up the next command buffer
    cmdbuf = new_cmdbuf()?;
}

This logic is somewhat unsatisfactory because it forces the CPU and GPU to work in a lockstep:

At first, the CPU is preparing GPU commands while the GPU is waiting for commands.
Then the CPU submits GPU commands and waits for them to execute, so the GPU is working while the CPU is waiting for the GPU to be done.
At the end the CPU processes GPU results while the GPU is waiting for commands.

In other words, we expect this sort of execution timeline…

Timeline of lockstep CPU/GPU execution

…where “C” stands for command buffer recording, “Execution” stands for execution of GPU work (simulation + results downloads) and “Results” stands for CPU-side results processing.

Command buffer recording has been purposely abbreviated on this diagram as it is expected to be faster than the other two steps, whose relative performance will depends on runtime configuration (relative speed of the GPU and storage device, number of simulation steps per output image…).

In this chapter, we will see how we can make more of these steps happen in parallel.

Command recording

Theory

Because command recording is fast with respect to command execution, allowing them to happen in parallel is not going to save much time. But it will be an occasion to introduce some principles on a simplified problem, and we will later apply the same principles on a broader scale to achieve the more ambitious goal of overlapping GPU computations with CPU result processing.

First of all, by looking back at the main simulation loop above, it should be clear to you that it is not possible for the recording of the first command buffer to happen in parallel with its execution. We cannot execute a command buffer that it is still in the process of being recorded.

What we can do, however, is change the logic of our main simulation loop after this first command buffer has been submitted to the GPU:

Instead of immediately waiting for the GPU to finish the freshly submitted work, we can start preparing a second command buffer on the CPU side while the GPU is busy working on the first command buffer that we just sent.
Once that second command buffer is ready, we have nothing else to do on the CPU side (for now), so we will just finish executing our first main loop iteration as before: await GPU execution, then process results.
By the time we reach the second main loop iteration, we will be able to reap the benefits of our optimization by having a second command buffer that can be submitted right away.
And then the cycle will repeat: we will prefer the command buffer for the third main loop iteration while the GPU work associated with the second main loop iteration is executing, then we will wait, process the second GPU result, and so on.

This sort of ahead-of-time command buffer preparation will result in parallel CPU/GPU execution through pipelining, a general-purpose optimization technique that can improve execution speed at the expense of some duplicated resource allocation and reduced code clarity.

In this particular case, the resource that is being duplicated is the command buffer. Before we used to have only one command buffer in flight at any point in time. Now we intermittently have two of them, one that is being recorded while another one is executing.

And in exchange for this resource duplication, we expect to get a new execution timeline…

Timeline of CPU/GPU execution with async command recording

…where command buffer recording and GPU work execution can run in parallel as long as the simulation produces at least two images, resulting in a small performance gain.

Implementation

In terms of code, we will first add a type alias for our favorite kind of command buffer builder, as these will start popping up more often in our code and the vulkano type for that has an uncomfortably long name…

/// Convenience type alias for primary command buffer builders
type CommandBufferBuilder = AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>;

…then we will extract the part of our code that appends simulation commands and builds the command buffer into an anonymous function, declared before the start of the main simulation loop, for reason that are soon going to become clear:

use std::sync::Arc;

// Record simulation steps and user extras into a command buffer, then build it
let mut record_simulation_and_build =
    |cmdbuild: &mut CommandBufferBuilder| -> Result<Arc<PrimaryAutoCommandBuffer>> {
        // Schedule some simulation steps
        schedule_simulation(options, &pipelines, cmdbuild, &mut concentrations)?;

        // Schedule any other user-requested work after the simulation
        schedule_after_simulation(concentrations.current(), cmdbuild)?;

        // Extract the old command buffer builder, replacing it with a blank one
        let old_cmdbuild = std::mem::replace(cmdbuild, new_cmdbuf()?);

        // Build the command buffer
        Ok(old_cmdbuild.build()?)
    };

One thing worth pointing out here is the use of the std::mem::replace() function. This standard library utility is used to work around a current limitation of the Rust compiler’s static ownership and borrowing analysis, which prevents it from accepting the following valid alternative:

let new_cmdbuild = new_cmdbuf()?;
let old_cmdbuild = *cmdbuild;
*cmdbuild = new_cmdbuild;

Click here for more details

The problem that the compiler is trying to protect us from is that because the record_simulation_and_build() function receives only a mutable reference to the command buffer builder, it is not allowed to move the builder away, unless it somehow replaces it with a different builder before the function returns.

This has to be the case otherwise we could get use-after-move Undefined Behavior…

// On the caller side

// Set up a command buffer builder
let mut cmdbuild = new_cmdbuf();

// Lend builder reference to record_simulation_and_build
record_simulation_and_build(&mut cmdbuild)?;

// Should still be able to use a builder after lending a reference to it, so
// this should not be use-after-move undefined behavior:
let cmdbuf = cmdbuild.build()?;

…but at present time, the associated static analysis is overly convervative, and only lets us move a value away if it’s immediately replaced in the same program instruction. Several utilities from std::mem let us work around this compiler limitation, std::mem::replace() is one of them.

Now that we have this new logic, we are going to manipulate a mixture of command buffer builders and command buffers in main(), so our former convention of naming command buffer builders cmdbuf is becoming confusing there. We therefore rename the local variable to cmdbuild…

let mut cmdbuild = new_cmdbuf()?;

…make sure a first command buffer gets built before the main simulation loop starts, as it will now assume the presence of a previously built command buffer…

// Set up the first command buffer
let mut cmdbuf = record_simulation_and_build(&mut cmdbuild)?;

…and we are finally ready to adapt the main simulation loop to our new pipelined logic:

// Produce the requested amount of concentration tables
for _ in 0..options.runner.num_output_steps {
    // Submit the command buffer to the GPU and wait for it to execute
    let future = cmdbuf
        .execute(context.queue.clone())?
        .then_signal_fence_and_flush()?;

    // Prepare the next command buffer
    cmdbuf = record_simulation_and_build(&mut cmdbuild)?;

    // Wait for the GPU to be done
    future.wait(None)?;

    // Perform operations after the GPU is done
    after_gpu_wait()?;
}

Are we done yet? Not quite! Our new pipelined simulation logic is not perfectly equivalent to the old one, because it builds one extra useless command buffer at the end, which will have a notable negative performance impact on simulation runs that produce few images.

In real-world code, this will rarely matter. But what if it does for your particular use case? Click here to find out what code changes are needed to eliminate this unnecessary work.

Unfortunately, another Rust compiler static analysis limitation will get in the way here and force us to logically break the invariant that a new command buffer may be present at the start of each loop iteration by making the cmdbuf object optional.

// Set up the first command buffer, if any
let num_output_steps = options.runner.num_output_steps;
let mut cmdbuf = Some(record_simulation_and_build(&mut cmdbuild)?);

We will then name our main loop’s counter so we can detect the last iteration…

for output_step in 0..num_output_steps {
   /* ... */
}

…modify the command buffer submission logic so that it assumes that a command buffer is present and take()s it from the original Option, which moves it away and replaces it with None…

let future = cmdbuf
    .take()
    .expect("If this iteration executes, a command buffer should be present")
    .execute(context.queue.clone())?
    .then_signal_fence_and_flush()?;

…and finally modify the command buffer building logic so that it does not run on the last simulation loop iteration, leaving the Option at None.

// Prepare the next command buffer, if any
if output_step != num_output_steps - 1 {
    cmdbuf = Some(record_simulation_and_build(&mut cmdbuild)?);
}

The reason why we need to go through this dance with Option is that if we simply attempted to do this in the obvious Option-free manner…

// Prepare the next command buffer, if any
if output_step != num_output_steps - 1 {
    cmdbuf = record_simulation_and_build(&mut cmdbuild)?;
}

…the Rust compiler’s static analysis will, as of 2025, not yet understand that it is fine for cmdbuf be left in an invalid moved-from state when output_step == num_output_steps - 1, because that condition will only be true upon reaching the last iteration of the loop and cmdbuf will not be used after the loop ends. And therefore the compiler will wrongly complain that a moved-from cmdbuf may be used on the (nonexistent) next loop iteration.

Conclusion

After this coding interlude, we are ready to reach some preliminary conclusions:

Pipelining is an optimization that can be applied when a computation has two steps A and B that execute on different hardware, where step A produces an output that step B consumes.
Pipelining allows you to run steps A and B in parallel, at the expense of…
- Needing to juggle with multiple copies of the output of step A (which will typically come at the expense of a higher application memory footprint).
- Having a more complex initialization procedure before your main loop, in order to bring your pipeline to the fully initialized state that your main loop expects.
- Needing even more complex logic to avoid unnecessary work at the end of the main loop, if you have real-world use cases where the number of main loop iterations is small enough that this extra work has non-negligible overhead.

As for performance benefits, you have been warned at the beginning of this section that command buffer recording is only pipelined here because it gives you an easier introduction to this optimization, and not because the author considers it to be worthwhile in this simulation.

And indeed, even if we wiggle our magic microbenchmarking wand and make GPU-to-CPU transfers and storage I/O disappear, the asymptotic performance benefit on larger problems will be small:

run_simulation/16x16workgroup/2048x1024domain/32outputs/32steps/compute
                        time:   [194.64 ms 195.80 ms 197.04 ms]
                        thrpt:  [10.899 Gelem/s 10.968 Gelem/s 11.033 Gelem/s]
                 change:
                        time:   [-4.9268% -4.0704% -3.2867%] (p = 0.00 < 0.05)
                        thrpt:  [+3.3984% +4.2432% +5.1821%]
                        Performance has improved.

Results processing

Theory

Encouraged by the modest but tangible performance improvements that pipelined command buffer recording brought, you may now try to achieve full CPU/GPU execution pipelining, in which GPU-side work execution and CPU-side results processing can overlap…

Timeline of CPU/GPU execution with async results processing

…but your first attempt will likely end with a puzzling compile-time or run-time error, which you will stare at blankly for a few minutes of incomprehension, feeling betrayed by your software infrastructure, before you figure it out and thank rustc or vulkano for saving you from yourself.

Indeed there is a trap with this form of pipelining, and one that is easy to fall into: if you are not careful, you are likely to end up trying to access a simulation result on the CPU side, that the GPU could be simultaneously overwriting with a newer result at the same time. Which is the textbook example of a variety of undefined behavior known as a data race.

To avoid this data race, we will need to add double buffering to our CPU-side VBuffer abstraction¹, so that our CPU code can read result N at the same time as our GPU code is busy producing result N+1 and transferring it to the CPU side. And the logic behind our main simulation loop is going to become a little bit more complicated again, because now we need to…

Make sure that by the time we enter the main simulation loop, a result is already available or in the process of being produced. Indeed, the clearest way to write pipelined code is to write each iteration of our main loop under the assumption that the pipeline is already operating at full capacity, taking any required initialization step to get there before the looping begins.
Rethink our CPU-GPU synchronization strategy so that the CPU code waits for a GPU result to be available before processing it, but does not start processing a result before having scheduled the production of the next result.

TODO: Explain code, comment benchmarks, exercise

Another example of how pipelining requires duplication of (in this case memory) resources. ↩

Keyboard shortcuts

Numerical Computing with Rust on GPU

Asynchronous compute

Command recording

Theory

Implementation

Conclusion

Results processing

Theory