Execution

It has been quite a long journey, but we are now reaching the end of our Vulkan tour!

In the last chapter, we have supplemented the GPU compute pipeline that we have set up previously with a set of memory resources. We can use those to initialize our dataset on the CPU side, move it to the fastest kind of GPU memory available, bind that fast memory to our compute pipeline, and bring results back to the CPU side once the computation is over.

There is just one missing piece before we turn these various components into a fully functional application. But it is quite a big one. How do we actually ask the GPU to perform time-consuming work, like copying data or executing compute pipelines? And how do we know when that work is over? In other words, how do we submit and synchronize with GPU work?

In this chapter, we will finally answer that question, which will allow us to put everything together into a basic GPU application, complete with automated testing and performance benchmarks.

Vulkan execution primer

Problem statement

Because they are independent pieces of hardware, CPUs and GPUs have a natural ability to work concurrently. As the GPU is processing some work, nothing prevents the CPU from doing other work on its side. Which can be useful even in “pure GPU” computations where that CPU work serves no other purpose than to collect and save GPU results or prepare the execution of more GPU work.

Competent GPU programmers know this and will leverage this concurrent execution capability whenever it can help the application at hand, which is why synchronous GPU APIs that force the CPU to wait for the GPU when it doesn’t need to are a performance crime. In the world of GPU programming, any GPU command that can take a nontrivial amount of time to process, and is expected to be used regularly throughout an application’s lifetime, should be asynchronous.

But experience with Vulkan’s predecessor OpenGL, whose implementors already understood the importance of asynchronism, revealed that there is a lot more to GPU command execution performance than making every major API command asynchronous:

Sending commands from the CPU to the GPU comes at a significant cost. If the API does not allow applications to amortize this cost by sending commands in batches, then GPU drivers will have to batch them on their own, resulting in unpredictable delays between the moment where applications call API commands and the moment where they start executing.
Scheduling GPU work also involves some CPU work, which can accumulate into significant overhead in applications that need lots of short-running GPU commands. Distributing this CPU work across multiple CPU threads could give applications more headroom before it becomes a performance bottleneck… if OpenGL’s global state machine design did not get in the way.
Assuming several CPU threads do get involved, funneling all commands through a single command submission interface can easily become a bottleneck. So GPU hardware provides multiple submission interfaces, but the only way to use them with zero thread contention is for GPU APIs to expose them in such a way that each CPU thread can get one.
Once we accept the idea of multiple submission interfaces, it is not that much of a stretch to introduce specialized submission interfaces for commands that have a good chance to execute in parallel with other ones. This way, GPU hardware and drivers do not need to look far ahead in the command stream to find such commands, then prove that executing them earlier than expected won’t break the application’s sequential command execution logic.
Speaking of sequential command execution, the promise made by older GPU APIs that GPU commands will execute one after another without observable overlap is fundamentally at odds with how modern pipelined and cache-incoherent GPU hardware works. Maintaining this illusion requires GPU drivers to automatically inject many pipeline and memory barriers, causing reduced hardware utilization and some CPU overhead. It would be great if applications could control this mechanism to selectively allow the kinds of command execution overlap and cache incoherence that do not affect their algorithm’s correctness.
But there is more to GPU work synchronization than command pipelining control. Sometimes, CPU threads do need to wait for GPU commands to be done executing some work. And now that we have multiple channels for GPU work submission, we also need to think about inter-command dependencies across those channels, or even across multiple GPUs. OpenGL provided few tools to do this besides the glFinish() sledgehammer which waits for all previously submitted work to complete, thus creating a humonguous GPU pipeline bubble while imposing a lot more waiting than necessary on the CPU. For any application of even mild complexity, finer-grained synchronization would be highly desirable.

What Vulkan provides

Building upon decades of OpenGL application and driver experience, Vulkan set out to devise a more modern GPU command submission and synchronization model that resolves all of the above problems, at the expense of a large increase in conceptual complexity:

Commands are not directly submitted to the GPU driver, but first collected in batches called command buffers (solving problem #1). Unlike in OpenGL, recording a command buffer does not involve modifying any kind of global GPU/driver state, so threads can easily record command buffers in parallel (solving problem #2).
Multiple hardware command submission channels are exposed in the API via queues (solving problem #3), which are grouped into queue families to express specialization towards particular kinds of work that is more likely to execute in parallel (solving problem #4).
GPUs may overlap command execution in any way they like by default, without enforcing a consistent view of GPU memory across concurrent commands. Applications can restrict this freedom whenever necessary by inserting pipeline and memory barriers between two commands within a command buffer (solving problem #5).
While Vulkan still has a device-wide “wait for idle” operation, which is supplemented by a threading-friendly queue-local version, it is strongly discouraged to use such synchronization for any other purpose than debugging. Finer-grained synchronization primitives are used instead for everyday work (solving problem #6):
- Fences let CPU code wait for a specific batch of previously submitted commands. They are specialized for everyday “wait for CPU-accessible buffers to be filled before reading them” scenarios, and should be the most efficient tool for these use cases.
- Events allow GPU commands within a queue to await a signal that can be sent by a previous command within the same queue (as a more flexible but more expensive alternative to the barriers discussed above) or by host code.
- Semaphores provide maximal flexibility at the expense of maximal overhead. They can be signaled by the host or by a GPU command batch, and can be awaited by the host or by another GPU command batch. They are the only Vulkan synchronization primitive that allows work to synchronize across GPU queues without CPU intervention.

The `vulkano` layer

As you can imagine, all this new API complexity can take a while to master and has been a common source of application correctness and performance bugs. This proved especially true in the area of command pipelining, where Vulkan’s “allow arbitrary overlap and incoherence by default” strategy has proven to be a remarkably powerful source of application developer confusion.

Other modern performance-oriented GPU APIs like Apple Metal and WebGPU have thus refused to follow this particular path. Their design rationale was that even though the latency increase caused by forbidding observable command overlap cannot be fully compensated, most of the associated throughput loss can be compensated by letting enough commands execute in parallel across device queues, and for sufficiently complex applications that should be good enough.

But that is forgetting a bit quickly that Vulkan is about choice. When the GPU API provides you with the most performant but least ergonomic way to do something, nothing prevents you from building a higher-level layer on top of it that improves ergonomics at the expense of some performance loss. Whereas if you start from a higher-level API, making it lower-level to improve performance at the expense of ergonomics can be impossible. This is why good layered abstraction design with high-quality low-level layers matter, and in this scheme Vulkan was designed to be the ultimate low-level layer, not necessarily a high-level layer that all applications should use directly.

In the case of Vulkan programming in Rust, we have vulkano for this purpose, and in this area like others it delivers as expected:

As a default choice, the high-level AutoCommandBufferBuilder layer implements a simple Metal-like command queuing model. It should provide good enough performance for typical numerical computing applications, with much improved ergonomics over raw Vulkan.
If you ever face a performance problem that originates from the resulting lack of GPU command overlap, or from the overhead of CPU-side state tracking (which automatic barrier insertion entails), that is not the end of the world. All you will need to do is to reach for the lower-level RecordingCommandBuffer unsafe layer, and locally face the full complexity of the Vulkan command pipelining model in the areas of your applications that need it for performance. The rest of the application can remain largely untouched.

For the purpose of this course, we will not need lower-level control than what the high-level safe vulkano layer provides, so the remainder of this chapter will exclusively use that layer.

Command buffer

As mentioned above, Vulkan requires any nontrivial and potentially recuring GPU work to be packaged up into a command buffer before it can be submitted to the GPU for execution. In our first number-squaring example, the part of the work that qualifies as command buffer worthy is…

Copying CPU-generated inputs to the fastest available kind of GPU memory
Binding the compute pipeline so that future execution (dispatch) commands refer to it
Binding the fast GPU buffer descriptor set so that the compute pipeline uses it
Executing (dispatching) the compute pipeline with a suitable number of workgroups
Copying the output back to CPU-accessible memory

…and using vulkano, we can build a command buffer that does this as follows:

use vulkano::{
    command_buffer::{
        auto::{AutoCommandBufferBuilder, PrimaryAutoCommandBuffer},
        CommandBufferUsage, CopyBufferInfo,
    },
    pipeline::PipelineBindPoint,
};

/// Build a command buffer that does all the GPU work
pub fn build_command_buffer(
    context: &Context,
    cpu_input: Subbuffer<[f32]>,
    gpu_data: Subbuffer<[f32]>,
    gpu_pipeline: Pipeline,
    gpu_data_desc: Arc<DescriptorSet>,
    pipeline_options: &PipelineOptions,
    cpu_output: Subbuffer<[f32]>,
) -> Result<Arc<PrimaryAutoCommandBuffer>> {
    // Set up a primary command buffer
    let mut builder = AutoCommandBufferBuilder::primary(
        context.comm_allocator.clone(),
        context.queue.queue_family_index(),
        CommandBufferUsage::OneTimeSubmit,
    )?;

    // Copy CPU inputs to the GPU side
    builder.copy_buffer(CopyBufferInfo::buffers(cpu_input, gpu_data.clone()))?;

    // Bind the compute pipeline for future dispatches
    builder.bind_pipeline_compute(gpu_pipeline.compute)?;

    // Bind memory to the compute pipeline
    builder.bind_descriptor_sets(
        PipelineBindPoint::Compute,
        gpu_pipeline.layout,
        DATA_SET,
        gpu_data_desc,
    )?;

    // Execute the compute pipeline with an appropriate number of work groups
    let num_work_groups = cpu_output
        .len()
        .div_ceil(pipeline_options.workgroup_size.get() as u64);
    // SAFETY: GPU shader has been checked for absence of undefined behavior
    //         given a correct execution configuration, and this is one
    unsafe {
        builder.dispatch([num_work_groups as u32, 1, 1])?;
    }

    // Retrieve outputs back to the CPU side
    builder.copy_buffer(CopyBufferInfo::buffers(gpu_data, cpu_output))?;

    // Finalize the command buffer object
    Ok(builder.build()?)
}

As usual, a few things should be pointed out about this code:

That’s a lot of function parameters! Which comes from the fact that this function asks the GPU to do many different things. Our example code is written that way because it allows us to introduce Vulkan concepts in a more logical order, but real-world Vulkan apps would benefit from spreading the command recording process across more functions that each take an &mut AutoCommandBufferBuilder as a parameter.
- Generally speaking, functions should favor &mut AutoCommandBufferBuilder over building command buffers internally until you are ready to submit work to the GPU. This allows you to pack your GPU work into as few command buffers as possible, which may improve command execution efficiency.¹
We are building a primary command buffer, which can be directly submitted to the GPU. This is in contrast with secondary command buffers, which can be inserted into primary command buffers. The latter can be used to avoid repeatedly recording recurring commands,² and it also combines really well with a Vulkan graphics rendering feature called render passes that falls outside of the scope of this compute-focused introductory course.

Execution

The high-level vulkano API generally tries to stay close to the underlying Vulkan C API, using identical concepts and naming. Deviations from raw Vulkan must be motivated by the desire to be memory-, type- and thread-safe by default, in line with Rust’s design goals. This is good as it makes it easier to take documentation about the Vulkan C API and apply it to vulkano-based programs.

However, there is one area where vulkano’s high-level API strays rather far from the concepts of its Vulkan backend, and that is command execution and synchronization. This makes sense because that part of the vulkano API needs to guard against safety hazards related to GPU hardware asynchronously reading from and writing to CPU-managed objects, which is quite difficult.³

Instead of closely matching the Vulkan functions used for command buffer submission (like vkSubmit()) and synchronization (like vkWaitForFences()), the high-level vulkano API for command submission and synchronization therefore currently⁴ works as follows:

Prepare to send a PrimaryAutoCommandBuffer to a Queue using its execute() method. This produces a CommandBufferExecFuture, which is a special kind of GpuFuture. GPU future objects model events that will eventually occur, in this case the moment where commands within this particular command buffer will be done executing.
- It is very important to understand that at this point, the command buffer has not been sent to the GPU. Indeed, vulkano must expose the fact that Vulkan lets us send multiple command buffers to the GPU with a single vkSubmit() operation in order to keep performance competitive against direct use of the Vulkan C API.
- Other GPU future objects are available, representing things like the signal that our system is ready to render a new frame in real-time graphics.
Chain as many of these futures as desired using methods like then_execute() and join(), which respectively represent sequential and concurrent execution. This will produce a GpuFuture object that represents an arbitrarily complex graph of GPU tasks that may or may not be linked by sequential execution dependencies.
Indicate points where Vulkan synchronization objects (fences and semaphores) should be signaled using GPU future methods like then_signal_fence() and then_signal_semaphore().
Submit all previously scheduled work to the GPU whenever desired using the flush() method. This does not destroy the associated GPU future object so that you can keep scheduling more work after the work that has just started executing.
- One will often want to do this at points of the asynchronous task graph that are observable through signaling of a Vulkan synchronization object, which is why vulkano provides then_signal_fence_and_flush() and then_signal_semaphore_and_flush() API shortcuts for this use case.
When the CPU momentarily runs out of work to submit to the GPU, await completion of some previous work using the wait() method of the particular GPU future type that is returned by the then_signal_fence() operation. Because making a CPU wait for GPU work for an unbounded amount of time is bad, polling and timeout options are also available here.

In the context of our number-squaring example, we can use this API as follows:

use vulkano::{
    command_buffer::PrimaryCommandBufferAbstract,
    sync::future::GpuFuture,
};

/// Synchronously execute the previously prepared command buffer
pub fn run_and_wait(context: &Context, commands: Arc<PrimaryAutoCommandBuffer>) -> Result<()> {
    commands
        .execute(context.queue.clone())?
        .then_signal_fence_and_flush()?
        .wait(None)?;
    Ok(())
}

Notice we need to bring a few traits into scope in order to use the execute() method (whose implementation is shared across all kinds of primary command buffers using the PrimaryCommandBufferAbstract trait) and the GpuFuture trait’s methods (whose implementation is also shared across all kinds of GPU futures.

Exercises

Final executable

Now that we have all parts of the computation ready, we can put them all together into a complete program by adding all of the above functions to exercises/src/square.rs, then rewriting the exercises/src/bin/square.rs binary’s source into the following:

use clap::Parser;
use grayscott_exercises::{
    context::{Context, ContextOptions},
    square::{self, Pipeline, PipelineOptions},
    Result,
};

/// This program generates and squares an array of numbers
#[derive(Parser, Debug)]
#[command(version, author)]
struct Options {
    /// Vulkan context configuration
    #[command(flatten)]
    context: ContextOptions,

    /// Compute pipeline configuration
    #[command(flatten)]
    pipeline: PipelineOptions,

    /// Input data configuration
    #[command(flatten)]
    input: InputOptions,
}

fn main() -> Result<()> {
    // Decode CLI arguments
    let options = Options::parse();

    // Set up a generic Vulkan context
    let context = Context::new(&options.context)?;

    // Set up a compute pipeline
    let pipeline = Pipeline::new(&context, &options.pipeline)?;

    // Set up memory resources
    let cpu_input = square::setup_cpu_input(&context, &options.input)?;
    let gpu_data = square::setup_gpu_data(&context, &options.input)?;
    let gpu_data_desc = square::setup_descriptor_set(
        &context,
        &pipeline,
        gpu_data.clone(),
    )?;
    let cpu_output = square::setup_cpu_output(&context, &options.input)?;

    // Build a command buffer
    let commands = square::build_command_buffer(
        &context,
        cpu_input.clone(),
        gpu_data,
        pipeline,
        gpu_data_desc,
        &options.pipeline,
        cpu_output.clone(),
    )?;

    // Synchronously execute the command buffer
    square::run_and_wait(&context, commands)?;

    // Check computation results
    let cpu_input = cpu_input.read()?;
    let cpu_output = cpu_output.read()?;
    assert!((cpu_input.iter())
            .zip(cpu_output.iter())
            .all(|(input, output)| *output == input.powi(2)));
    println!("All numbers have been squared correctly!");
    Ok(())
}

Notice the use of the Subbuffer::read() method at the end, which is needed in order to allow vulkano to check for absence of data races between the CPU and the GPU.

Execute this binary in debug mode, then in release mode, while measuring execution times:

cargo build --bin square >/dev/null  \
&& time cargo run --bin square  \
&& cargo build --release --bin square >/dev/null  \
&& time cargo run --release --bin square

Do you understand the outcome? If not, think about what you are measuring for a bit to see if you can figure it out for yourself, before moving to the next part.

Benchmarks

While timing complete programs like square can be nice for early performance exploration, as soon as you start getting into the realm of performance optimization it is good to figure out which part of the program is most critical to performance and measure it separately.

A benchmark has been set up to this end in exercises/benches/square.rs. Once you are done with the above exercise, it should compile and run. Try to run it with cargo bench --bench square, and use it to determine…

Which parts of the process are slowest on the various devices exposed by your system.
How their performance scales with the various tunable parameters.

Note that the cargo bench command also accepts a regular expression argument that can be used to only run selected benchmarks. It can be used like this:

cargo bench --bench square -- "(input1000|workgroup64)/"

These benchmarks are pretty long-running so…

If you are running them on a laptop, plug in the power adapter
If you are running them on an HPC center, run them on a worker node
Do not hesitate to stop them ahead of time once you have seen what you want to see, then re-run them in a more restricted configuration.

One thing you can also to is to modify this program so that build_command_buffer schedules multiple compute shader dispatches instead of one (i.e. instead of only squaring numbers, you elevate them to the power of 2, 4, 8, 16…). Note that this does not require you to repeatedly re-bind the compute pipeline and data descriptor set.

Modify the program to measure performance at various numbers of compute shader dispatches. What do you observe, and what does this tell you about the most likely GPU-side performance bottleneck(s) of this number-squaring program?

Optimizations

This last exercise may take you a significant amount of time and should only be worked on if you finished well ahead of the expected end of the course.

Get back your list of CPU-GPU data handling strategies from the exercise of the last chapter, implement one of them (preferably one of those that are easy to implement), and use the benchmark to measure how it affects execution performance.

Ignoring for a moment the question of how expensive it is to create and build a command buffer (cheap but not free as usual), some Vulkan implementations are known to insert pipeline barriers between consecutive command buffers, most likely in an attempt to simplify the implementation of synchronization primitives that operate at command buffer boundaries like fences. ↩
…but reusing command buffers like this comes at the cost of losing some GPU driver optimizations. It is therefore recommended that applications first attempt to resolve command recording bottlenecks by spreading the command recording load across multiple CPU threads, before falling back to secondary command buffers if that is not enough. ↩
Rust’s compile-time safety proofs build on the fact that the compiler has full knowledge of the program execution and data flow, which is not true in the presence of asynchronous GPU work execution. This problem is also encountered when designing CPU threading APIs, where it is commonly restored by introducing synchronization points that wait for all tasks to be finish executing and release data references (a design known as fork-join or structured concurrency). But as previously discussed, such synchronization points are unwelcome in GPU programming, which is why vulkano has instead gone for a run-time safety tracking mechanism that was largely custom-built for its use case. ↩
The GpuFuture abstraction has a number of known flaws, and there is work ongoing in the vulkano-taskgraph crate with the aim of eventually replacing it. But that work is not ready for prime time yet. ↩

Keyboard shortcuts

Numerical Computing with Rust on GPU