Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pipeline

Now that we have set up some generic Vulkan infrastructure, we are ready to start working on our specific problem, namely squaring an array of floating-point numbers. The first step in this journey will be to set up a ComputePipeline, which is a GPU program that can performs the squaring operation on some partially unspecified memory location. But as you will see in this chapter, this process involves a suprisingly large number of steps.

Choosing a language

In GPU APIs, GPU-side code is traditionally written using a domain-specific programming language. Each major GPU API would provide its own language, so for a long time the top players in the portable GPU API space were GLSL for OpenGL, OpenCL C for OpenCL and HLSL for Direct3D.

More recently these old-timers have been joined by MSL for Metal and WGSL for WebGPU. But most importantly Khronos APIs have moved away from ingesting GPU programs written in a specific programmer-facing language, and are instead defined in terms of an assembly-like intermediate compiler representation called SPIR-V. This has several benefits:

  • GPU drivers become simpler, they don’t bundle a full compiler for a C-like language anymore. This allows application developers to have faster GPU code compilation and less driver bugs.
  • GPU programs go through a first round of compilation during the application-building process, which provides opportunities for faster compile-time error reporting (before the application starts) and more reliable driver-agnostic program optimizations.
  • Interoperability between GPU APIs becomes easier because translating each GPU DSL to SPIR-V is easier than translating from one DSL to another.
  • Introducing new GPU programming languages like Slang, or adapting CPU-oriented programming languages like Rust for GPU programming, becomes easier.

The last point begs the question: should this course keep using the traditional GLSL programming language from Khronos, embrace to a more modern GPU programming language like Slang, or leverage the rust-gpu project to get rid of the cross-language interface and be able to write all code in Rust? For this edition, we chose to keep using GLSL for a few reasons:

  • Vulkan is specified in terms of SPIR-V, but given SPIR-V’s assembly-like nature, writing this course’s code examples directly in SPIR-V would not be a pedagogically sensible option.
  • Because the Khronos Group maintains all of the Vulkan, SPIR-V and GLSL specifications, they are quick to extend GLSL with any feature that gets added to SPIR-V. This means that any new GPU programming feature that gets added to Vulkan and SPIR-V will be usable from GLSL first, before it gets added to any other GPU programming language.
  • Large amounts of existing Vulkan code and training material is written in terms of GLSL programs. So if you need help with your Vulkan code, you are more likely to find it with GLSL than if you use any other language that compiles to SPIR-V.
  • GLSL is a rather easy language to learn. Being purpose-built for GPU programming, it also naturally integrates several GPU hardware features and limitations that will feel quite out of place in a general-purpose CPU programming language like Rust or C/++.1
  • As rust-gpu specifically is neither very mature nor well-documented, integrating it into a vulkano-bases application involves jumping through a few hoops. In contrast, GLSL enjoys good first-party documentation and direct integration into vulkano (via vulkano-shaders) that make it particularly easy to use in vulkano-based applications.

Number-squaring shader

Like OpenGL before it, Vulkan supports two different styles of GPU programs or pipelines:

  • Graphics pipelines are designed for traditional 3D graphics applications. These typically2 render textured triangles to a bitmap target (like a screen) through a complex multi-stage pipeline, where some stages are customizable via user-defined code, and others are implemented using specialized hardware. Each of the user-defined hooks is commonly called a shader (vertex shader, tesselation shader, fragment shader…), likely because the eventual output is a shade of color. These pipelines have been a staple of GPU APIs since the early 2000s.
  • Compute pipelines were introduced much later, in the early 2010s, following the availability of increasingly general-purpose GPU hardware on which triangle-rendering became a special case rather than a core hardware feature. They greatly simplify the aforementioned multi-stage pipeline into a single compute shader stage, which is more appropriate for computations that do not naturally bend into a triangle-rendering shape.

Because it is focused on general-purpose numerical computations, this course will exclusively discuss compute pipelines and shaders. Our first number-squaring program will therefore be implemented as a GLSL compute shader.

Unlike other programming languages, GLSL makes language version requirements something that is directly specified by the program, rather than indirectly requested through e.g. compiler flags. Our GLSL program thus starts by stating which GLSL specification revision it is written against:

#version 460

We then specify how our GPU code will interface with CPU-side code. This is a danger zone. Any change to this part will often need be accompanied by matching changes to the CPU-side code.

First of all, we begin by specifying how our GPU progrma will exchange data with the outside world. CPU-GPU interfaces are specified using GLSL interface blocks, and the particular kind of interface block that we are using here is called a shader storage block.

// Shader storage block used to feed in the input/output data
layout(set = 0, binding = 0) buffer DataBuffer {
  float data[];
} Data;

Let’s break down this unfamiliar GLSL syntax:

  • With the buffer keyword, we tell the GLSL compiler that before we run this program, we are going to attach a buffer to it. Buffers represent blocks of GPU-accessible memory with a user-defined data layout, and are one of the two basic kinds of Vulkan memory resources (the other being images, which model texturing units).
  • To bind a buffer to this shader storage block, we will to need to refer to it using some identifier on the host side. In GLSL, integer identifiers are specified inside of a layout() clause for this purpose. Vulkan uses hierarchical identifiers composed of two numbers, a set number and a relative binding identifier within that set. This allows resources to be bound at the granularity of entire sets, thus amortizing the overhead of binding operations which were a common performance bottleneck in GPU APIs before the Vulkan days.
  • Interface blocks must have a block name (here DataBuffer), which in the case of compute pipelines3 is only used by host-side tooling like error messages and debugging tools. They may also have an instance name (here Data), which is used to scope the inner data members, otherwise members of the storage block will be exposed at global scope.
  • Finally, in a pair of curly braces between the block name and the instance name, a set of data members is defined, with a syntax similar to that of a C struct declaration.4 As in C, the last member can be a dynamically sized array, which is how we express that our buffer contains just a simple array of single-precision floating-point numbers.

After specifying our shader’s input/output data configuration, we then specify its execution configuration by setting a default workgroup size and a specialization constant that can be used to change the workgroup size from the CPU side.

// 1D shader workgroups default to 64 work items, this can be reconfigured
layout(local_size_x = 64) in;
layout(local_size_x_id = 0) in;

Again, this warrants some explanations:

  • As we will later see, a computer shader executes as a one- to three-dimensional grid of workgroups, each of which contains an identically sized chunk of work items which are sequential tasks that are relatively independent from each other.5
  • The size of workgroups represent the granularity at which work items are distributed across the GPU’s compute units and awaited for completion. In more advanced GPU programs, it also controls the granularity at which work items may easily synchronize with each other. Because this parameter affects many aspects of compute shader execution, shaders will often execute most efficiently at a certain hardware- and workload-dependent workgroup size that is hard to predict ahead of time and best tuned through empirical benchmarking.
  • Because of this, and because the correctness of a compute shader depends on how many work items are spawned but the Vulkan API for executing compute pipelines specifies how many workgroups are spawned, it is best if the size of workgroups is controlled from the CPU side and easily tunable for some particular hardware, rather than hardcoded in GLSL.
  • Therefore, although GLSL only mandates that a default workgroup size be specified in the shader via the layout(local_work_size...) in; syntax, we additionally use the layout(local_work_size_id...) in; syntax to define a specialization constant6 associated with the workgroup size. We will later use it to change the workgroup size from CPU code.
  • We only need to specify the first workgroup dimension (x) in this one-dimensional computation, GLSL will automatically infer the remaining dimensions to be equal to 1.

Finally, after specifying how the shader interfaces with the CPU, we can write the entry point that performs the expected number squaring:

void main() {
  uint index = gl_GlobalInvocationID.x;
  if (index < Data.data.length()) {
    Data.data[index] *= Data.data[index];
  }
}

Here again, although this code snippet is short, it has several aspects worth highlighting:

  • Following an old tradition, GLSL mandates that our entry point be called main(), takes no parameter and returns no output value.
  • Inside of the entry point, we specify the work that each of the work items is meant to do, treating that work item as a sequential task running in parallel with all other work items. In our case, each work item squares a single floating-point number7 from the data array that we have previously declared as part of the Data shader storage block.
  • To know which work item we are dealing with, we use the gl_GlobalInvodationID built-in variable from GLSL. This provides a 3D coordinate of the active work item within the overall 3D grid of all work items. Here our problem is inherently one-dimensional, so we treat that 3D grid as a 1D grid by setting its second and third dimension to 1. Thus we only care about the first dimension (x coordinate) of the gl_GlobalInvodationID integer vector.
  • In an ideal world, we would like to execute this compute shader in a configuration that has exactly one work item per floating-point number in the Data.data array, and that would be the end of it. But in the real world, we cannot do so, because we can only spawn a number of work items that is a multiple of our workgroup size, and said workgroup size must be large enough (typically a multiple of 64)8 for us to achieve good execution efficiency. Thus we will need to spawn a few extra work items and cut them out from the computation using the kind of bounds checking featured in the code above.
  • Although designed to look like C, GLSL is rather different from C from a semantics point of view. In particular it has no pointers, few implicit conversions, and even dynamically sized arrays have a known size that can be queried using a .length() built-in method.

Putting it all together, our float-squaring compute shader looks like this:

#version 460

// Shader storage block used to feed in the input/output data
layout(set = 0, binding = 0) buffer DataBuffer {
  float data[];
} Data;

// 1D shader workgroups default to 64 work items, this can be reconfigured
layout(local_size_x = 64) in;
layout(local_size_x_id = 0) in;

void main() {
  uint index = gl_GlobalInvocationID.x;
  if (index < Data.data.length()) {
    Data.data[index] *= Data.data[index];
  }
}

We will now proceed to save it into a source file. File extension .comp is advised for easy compatibility with text editor plugins like GLSL linters, so we propose to save it at location exercises/src/square.comp. And once this is done, we will be able to proceed with the next step, which is to load this GPU code into our Rust program and start building a compute pipeline out of it.

SPIR-V interface

As mentioned earlier, Vulkan cannot directly use GLSL shaders. They must first be compiled into an assembly-like intermediate representation called SPIR-V. This entails the use of tools like shaderc during application compilation, which slightly complicates their build process.

Because we use the higher-level vulkano Vulkan bindings, however, we have access to its optional vulkano-shaders component, which makes usage of GLSL shaders a fair bit easier. To be more specific, vulkano-shaders currently provides the following basic functionality:

  • Download and build an internal copy of the shaderc GLSL to SPIR-V compiler if it is not installed on the host system.
  • Use shaderc at compilation time to translate the application’s GLSL shaders to SPIR-V, then bundle the resulting SPIR-V into the application executable, automatically updating it whenever the original GLSL source code changes.
  • Generate a load() function to turn the raw SPIR-V binary into a higher-level ShaderModule Vulkan object. If the shader uses optional GLSL/SPIR-V features, this function will also check that the target device supports them along the way.
  • Translate struct definitions from the GLSL code into Rust structs with identical member names and memory layout. In the case of GLSL linear algebra types like mat3, the translation can be customized to use types from various popular Rust linear algebra libraries.

To use this package, we must first add vulkano-shaders as a dependency. This has already been done for you in the provided source code:

# You do not need to type in this command, it has been done for you
cargo add --features "shaderc-debug" vulkano-shaders

Notice that we enable the optional shaderc-debug, which ensures that our shaders are compiled with debug information. This is useful when running any kind of GPU debugging or profiling tool on our programs, so you would normally only want to disable this feature in production builds.

After this is done, you can create a new Rust code module dedicated to our new compute pipeline. To this end, you can first declare a new module in the toplevel exercises/src/lib.rs source file…

pub mod square;

…then create the associated exercises/src/square.rs source file with the following content:

//! Number-squaring compute pipeline

/// Compute shader used for number squaring
mod shader {
    vulkano_shaders::shader! {
        ty: "compute",
        path: "src/square.comp"
    }
}

Unfortunately, vulkano-shaders lacks a feature for automatically generating Rust-side constants matching the integer identifiers in the shader’s interface. As a fallback, it is a good idea to have some Rust constants that mirror them. This is not great as we will have to update those constants anytime we change the matching GLSL code, but it is better than guessing the meaning of hardcoded integer identifiers throughout our Rust codebase whenever the GLSL code changes.

#![allow(unused)]
fn main() {
/// Descriptor set that is used to bind a data buffer to the shader
pub const DATA_SET: u32 = 0;

/// Binding within `DATA_SET` that is used for the data buffer
pub const DATA_BINDING: u32 = 0;

/// Specialization constant that is used to adjust the workgroup size
const WORKGROUP_SIZE: u32 = 0;
}

Specializing the code

As mentioned earlier, it is generally wiser to specify the compute shader’s workgroup size from CPU code. This can be done using the SPIR-V specialization constant mechanism, and we have set up a suitable specialization constant on the GLSL side to allow this.

Given this previous preparation, we can now add a suitable CLI parameter to our program and use it to specialize our SPIR-V shader to the workgroup size that we would like. Here is a way to do so, taking some extra care to detect GLSL/Rust code desynchronization along the way:

use crate::Result;
use clap::Args;
use std::{num::NonZeroU32, sync::Arc};
use vulkano::{
    device::Device,
    shader::{SpecializationConstant, SpecializedShaderModule},
};

/// CLI parameters that guide pipeline creation
#[derive(Debug, Args)]
pub struct PipelineOptions {
    /// 1D workgroup size
    ///
    /// Vulkan guarantees support of any workgroup size from 1 to 1024, but
    /// a multiple of 64 is best for real-world hardware.
    #[arg(short, long, default_value = "64")]
    pub workgroup_size: NonZeroU32,
}

/// Set up a specialized shader module with a certain workgroup size
fn setup_shader_module(
    device: Arc<Device>,
    options: &PipelineOptions,
) -> Result<Arc<SpecializedShaderModule>> {
    // Build a shader module from our SPIR-V code, checking device support
    let module = shader::load(device)?;

    // Check default specialization constant values match expectations
    //
    // This allows us to detect some situations in which the GLSL interface has
    // changed without a matching CPU code update, which can otherwise result in
    // weird application bugs.
    let mut constants = module.specialization_constants().clone();
    assert_eq!(
        constants.len(),
        1,
        "Only expected one specialization constant"
    );
    let workgroup_size = constants
        .get_mut(&WORKGROUP_SIZE)
        .expect("There should be a workgroup size specialization constant");
    assert!(
        matches!(workgroup_size, SpecializationConstant::U32(_)),
        "Workgroup size specialization constant should be a GLSL uint = u32 in Rust",
    );

    // Specify the shader workgroup size
    *workgroup_size = SpecializationConstant::U32(options.workgroup_size.get());

    // Specialize the shader module accordingly
    Ok(module.specialize(constants)?)
}

Entry point and pipeline stage

The word shader unfortunately has a slightly overloaded meaning in the GPU programming community. GPU programmers like this course’s author commonly use it to refer to the implementation of a particular stage of a graphics or compute pipeline, which corresponds to a single GLSL/SPIR-V entry point. But the Vulkan specification actually calls “shader” a GPU compilation unit that is allowed to contain multiple entry points.

This generalized definition is particularly useful when implementing graphics pipelines, which have multiple stages. It means you can implement all pipeline stages inside of a single GLSL source file and have them easily share common type, constant and interface definitions, without performing redundant compilation of shared GLSL interface blocks.

On the compute side of things, however, compute pipelines only have a single stage, so compute shader modules with multiple entry points do not exist in GLSL. Yet SPIR-V still allows graphics shaders and non-GLSL compute shaders to have multiple entry points, and thus we will need one more step to locate the single entry point from our GLSL shader:

let entry_point = module
    .single_entry_point()
    .expect("GLSL compute shader should have a single entry point");

This entry point can then be turned into a pipeline stage by specifying, as you may have guessed, optional pipeline stage configuration parameters.

At the time of writing, the only thing we can configure at this stage is the subgroup size, which is effectively the SIMD granularity with which the device processes work items during compute shader execution. This is only configurable on a few devices (Intel GPUs being the main example that comes to mind), and even when it is manually configurable the default automatic configuration usually does a good job, so we will stick with this default here.

use vulkano::pipeline::PipelineShaderStageCreateInfo;

/// Set up a compute stage from a previously specialized shader module
fn setup_compute_stage(module: Arc<SpecializedShaderModule>) -> PipelineShaderStageCreateInfo {
    let entry_point = module
        .single_entry_point()
        .expect("GLSL compute shader should have a single entry point");
    PipelineShaderStageCreateInfo::new(entry_point)
}

Pipeline layout

We are reaching the end of the compute pipeline building process and there is only one remaining major configuration step to take care of, namely pipeline layout configuration.

To understand what this step is about, we need to know that a Vulkan compute pipeline combines two things that used to be separate in earlier GPU APIs, namely a set of GPU instructions (entry point) and some long-lived metadata that tells the GPU compiler ahead of time how memory resources are going to be bound to this GPU program (pipeline layout).

The latter notion is newly exposed in Vulkan, as in earlier GPU APIs this information used to be inferred by the GPU driver from the actual resource-binding pattern used by the application. This meant the GPU driver could end up having to recompile GPU code while the application was running if the resource-binding was not what the driver expected at compile time. In a graphical rendering context this would result in short application freezes, known as stutter, as rendering would momentarily stop while the GPU driver was waiting for shader recompilation to finish.

Such stutter is generally speaking unwelcome in real-time rendering but it was particularly problematic for some applications that Vulkan was designed to handle, such as Virtual Reality (VR) where it can induce motion sickness. Pipeline layouts were thus introduced as a way for the application to specify the required metadata ahead of time, so that the GPU driver can compile the compute shader correctly at initialization time without any need for later recompilation.

Sadly, there is price to pay for this more precise control on the time at which GPU programs get compiled: we now need to repeat information available elsewhere in GLSL or Rust code at compute pipeline compilation time, which could fall out of sync with the rest of the program. Which is why vulkano provides a quick way to configure a pipeline layout with sensible default settings inferred from the SPIR-V code’s interface blocks:

use vulkano::pipeline::layout::PipelineDescriptorSetLayoutCreateInfo;

let layout_info =
    PipelineDescriptorSetLayoutCreateInfo::from_stages([&stage_info]);

In this introductory course, we will not need to deviate from this automatically generated configuration, because most of the non-default pipeline layout settings are targeted at Vulkan programs that have a resource binding performance bottleneck and that will not be our case.

However, what we can do is to introspect the resulting auto-configuration to quickly make sure that the GLSL interface is actually what our CPU-side Rust code expects:

use vulkano::descriptor_set::layout::DescriptorType;

assert_eq!(
    layout_info.set_layouts.len(),
    1,
    "This program should only use a single descriptor set"
);
let set_info = &layout_info.set_layouts[DATA_SET as usize];
assert_eq!(
    set_info.bindings.len(),
    1,
    "The only descriptor set should only contain a single binding"
);
let binding_info = set_info
    .bindings
    .get(&DATA_BINDING)
    .expect("The only binding should be at the expected index");
assert_eq!(
    binding_info.descriptor_type,
    DescriptorType::StorageBuffer,
    "The only binding should be a storage buffer binding"
);
assert_eq!(
    binding_info.descriptor_count, 1,
    "The only binding should only contain a single descriptor"
);
assert!(
    layout_info.push_constant_ranges.is_empty(),
    "This program shouldn't be using push constants"
);

As before, this is not necessary for our program to work but it increases its error-reporting capabilities in the face of desynchronization between the CPU-GPU interfaces declared on the CPU and GPU sides. Given how mind-boggling and hard-to-debug the symptoms of such desynchronization can otherwise be, a bit of defensive programming doesn’t hurt here.

Finally, once our paranoia is satisfied, we can proceed to build the compute pipeline layout:

use vulkano::pipeline::layout::PipelineLayout;

let layout_info = layout_info.into_pipeline_layout_create_info(device.clone())?;
let layout = PipelineLayout::new(device, layout_info)?;

Putting it all together, we get the following pipeline layout setup process:

use vulkano::{
    descriptor_set::layout::DescriptorType,
    pipeline::layout::{PipelineDescriptorSetLayoutCreateInfo, PipelineLayout},
};

/// Set up the compute pipeline layout
fn setup_pipeline_layout(
    device: Arc<Device>,
    stage_info: &PipelineShaderStageCreateInfo,
) -> Result<Arc<PipelineLayout>> {
    // Auto-generate a sensible pipeline layout config
    let layout_info = PipelineDescriptorSetLayoutCreateInfo::from_stages([stage_info]);

    // Check that the pipeline layout meets our expectation
    //
    // Otherwise, the GLSL interface was likely changed without updating the
    // corresponding CPU code, and we just avoided rather unpleasant debugging.
    assert_eq!(
        layout_info.set_layouts.len(),
        1,
        "This program should only use a single descriptor set"
    );
    let set_info = &layout_info.set_layouts[DATA_SET as usize];
    assert_eq!(
        set_info.bindings.len(),
        1,
        "The only descriptor set should only contain a single binding"
    );
    let binding_info = set_info
        .bindings
        .get(&DATA_BINDING)
        .expect("The only binding should be at the expected index");
    assert_eq!(
        binding_info.descriptor_type,
        DescriptorType::StorageBuffer,
        "The only binding should be a storage buffer binding"
    );
    assert_eq!(
        binding_info.descriptor_count, 1,
        "The only binding should only contain a single descriptor"
    );
    assert!(
        layout_info.push_constant_ranges.is_empty(),
        "This program shouldn't be using push constants"
    );

    // Finish building the pipeline layout
    let layout_info = layout_info.into_pipeline_layout_create_info(device.clone())?;
    let layout = PipelineLayout::new(device, layout_info)?;
    Ok(layout)
}

Compute pipeline

We have finally reached the end of this chapter, and all the pieces of our compute pipeline are now ready. A drop of glue code is all it will take to make them work together:

use crate::context::Context;
use vulkano::pipeline::compute::{ComputePipeline, ComputePipelineCreateInfo};

/// Number-squaring compute pipeline with associated layout information
#[derive(Clone)]
pub struct Pipeline {
    compute: Arc<ComputePipeline>,
    layout: Arc<PipelineLayout>,
}
//
impl Pipeline {
    // Set up a number-squaring pipeline
    pub fn new(context: &Context, options: &PipelineOptions) -> Result<Self> {
        let shader_module = setup_shader_module(context.device.clone(), options)?;
        let stage_info = setup_compute_stage(shader_module);
        let layout = setup_pipeline_layout(context.device.clone(), &stage_info)?;
        let pipeline_info = ComputePipelineCreateInfo::stage_layout(stage_info, layout.clone());
        let compute = ComputePipeline::new(
            context.device.clone(),
            Some(context.pipeline_cache()),
            pipeline_info,
        )?;
        Ok(Self { compute, layout })
    }
}

Notice that a struct-based setup is used so that the pipeline layout information is kept around after pipeline creation. We will need it again later, in order to bind resources to this pipeline.

As you may guess from the sight of the ComputePipelineCreateInfo struct, a few bits of Vulkan configurability that we do not need have been swept under the metaphorical rug here. The new settings available in this struct allow us to…

  • Build our compute pipelines without optimization. This is useful when aiming for faster application startup at the expense of runtime performance (e.g. debug builds) or when investigating a bug that might be affected by GPU compiler optimizations.
  • Mark a compute pipeline as a derivative of another, which might9 enable faster builds when building sets of compute pipelines that are closely related to each other, for example ones that only differ by specialization constant values.

Wrap-up

If you followed through all of this, congratulations! You now know about all the basic steps involved in the process of building a Vulkan compute pipeline. To summarize, you must…

  • Pick a shading language:
    • GLSL for optimal community and tooling support.
    • Other languages that compile to SPIR-V (Slang, rust-gpu…).
  • Write a shader, paying close attention to the CPU-GPU interface:
    • Memory resources: buffers, images, …
    • Workgroup size control & other specialization constants.
    • Handling of out-of-bounds work items.
  • Get a SPIR-V shader module into your code:
    • Compile the shader into SPIR-V.
    • Load it into the program at compile- or run-time.
    • Build a device-specific shader module.
    • vulkano-shaders can do all these for you + check device requirements.
    • We advise naming your interface IDs for clarity and maintainability.
  • Turn that SPIR-V shader into a pipeline stage:
    • Apply specialization constants (can check CPU/GPU interface consistency).
    • Select the shader entry point that we are going to use.
    • Configure the pipeline stage.
  • Configure the pipeline’s layout:
    • Consider PipelineDescriptorSetLayoutCreateInfo for simple programs.
    • You can do MANY binding perf optimizations here, that we will not cover.
    • You can also check CPU/GPU binding interface consistency here.
  • Build the compute pipeline.

In the next chapter, we will set up some memory resources, which will later bind to this pipeline.

Exercise

The Vulkan compute pipeline setup process has many parts to it, and as a result this chapter contained a lot of information. It is advisable to quickly review it and the matching vulkano documentation, making sure you have a decent understanding of what’s going on before proceeding with the rest of the course.

After doing so, please fill in the square.comp and square.rs files and modify the lib.rs file using the instructions provided at the end of each part of this chapter.

Then modify the bin/simulate.rs program so that it allows specifying a workgroup size and creating a compute pipeline, and finally give it a test run to make sure that the resulting program passes all runtime checks in addition to compile-time ones.


  1. The C++ Language Extensions and C++ Language Support chapters of the CUDA programming guide should give you a quick taste of the challenges that are involved when taking a programming language that was designed for CPU programming and adapting it for GPU programming through a mixture of extensions and restrictions. And this is the most advanced attempt at such language adaptation, building on decades of effort from the richest GPU company in the world, and enjoying the luxury of only needing to support a single GPU vendor. As you can imagine, language adaptation projects that aim for cross-vendor portability with a more modest development team will have a harder time getting there.

  2. Ignoring a few emerging variations of the traditional graphics pipeline like raytracing and mesh shading, that may become the norm in the future if 1/all hardware in common use ends up supporting them and 2/they become undisputedly superior to the current vertex/tesselation/fragment graphics pipeline standard for all common rendering use cases.

  3. Graphics pipelines use block names to match interface blocks between pipeline stages, so that the shader associated with one pipeline stage can send data to the shader associated with the next pipeline stage. But compute pipelines only have a single stage, so this feature does not apply to them.

  4. The buffer memory layout rules used by Vulkan are actually a little different from those of struct members in C, which means that matching struct definitions on the CPU side must be written with care. But this is not an issue with our current buffer which contains just an array of floats (layed out as in C), and we will later see that vulkano-shaders makes such CPU/GPU data layout matching easy anyway.

  5. CUDA practicioners may also know work items as threads and workgroups as blocks. Generally speaking, the realm of GPU computing is very talented at coming up with many different names for the same thing, resulting in a confusing terminology mess. You may find the author’s inter-API glossary handy.

  6. Specialization constants are a powerful Vulkan feature, which leverages the fact that GPU programs are compiled just-in-time in order to let you modify some compilation constants within the SPIR-V program before it is compiled into a device binary. This allows you to set parameters that must be known at compile time (e.g. stack-allocated array sizes, workgroup sizes) from the CPU side, as well as to tell the GPU compiler about application parameters that are known to the CPU at startup time (e.g. CLI parameters) so the GPU compiler can specialize the output binary for these specific parameter values.

  7. This may sound like an overly small amount of work for a GPU work item, and to be fair it probably is. However, we must keep in mind that modern high-end GPUs contain many compute units that run in parallel, each of which executes SIMD instructions in a superscalar manner and leverages simultaneous multithreading for latency hiding. As a result, millions of work items are often required to fully utilize GPU resources. This is why the standard recommendation to GPU programmers, which we follow here, is to start by spawning as many work items as possible, and only later experiment with alternate configurations that spawn less work items that each handle a larger amount of work (which must be done with care due to GPU architecture details that we have no time to get into). Because GPU workgroup schedulers are very fast, it is expected that this optimization will only provide modest benefits in real-world GPU programs where each work item does more work than a single floating-point multiplication.

  8. This magical 64 factor comes from the fact that GPU workgroups hide SIMD and superscalar execution behind a common abstraction. If our workgroup size is not a multiple of the hardware SIMD width, then some SIMD lanes will be partially unused, resulting in reduced execution efficiency. The hardware SIMD width is almost always a power of two, and the largest SIMD width that is in common use on GPU hardware is the 64-wide wavefronts used by many AMD GPU architectures (GCN and older).

  9. Given the number of pessimistic comments in the documentation of prominent GPU vendors, this Vulkan feature looks like it was designed with very specific implementations in mind which are not the most common ones. Unless some vendor’s documentation explicitly told you to use it, it is probably in your best interest to ignore it.