Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Resources

Following the work of the previous chapter, we now have a GPU compute pipeline that can be used to square an array of numbers. Before we can use it, however, we will need a second important thing, namely an array of numbers that can be bound to this pipeline.

In this chapter, we will see how such an array can be allocated, initialized, and bundled into a descriptor set that can in turn be bound to our compute pipeline. Along the way, we will also start covering how data can be exchanged between the CPU and the GPU, though our treatment of this topic will not be complete until the next chapter.

Vulkan memory primer

Barring (important) exceptions discussed in the memory profiling course, the standard CPU programming infrastructure is good at providing the illusion that your system contains only one kind of RAM that you can allocate with malloc() and liberate with free().

But Vulkan is about programming GPUs, which make different tradeoffs than CPUs in the interest of cramming more number-crunching power per square centimeter of silicon. One of them is that real-world GPU hardware can access different types of memory, which must be carefully used together to achieve optimal performance. Here are some examples:

  • High-performance GPUs typically have dedicated RAM, called Video RAM or VRAM, that is separate from the main system RAM. VRAM usually has ~10x higher bandwidth than system RAM, at the expense of a larger access latency and coarser data transfer granularity.1
  • To speed up CPU-GPU data exchanges, some chunks of system RAM may be GPU-accessible, and some chunks of VRAM may be CPU-accessible. Such memory accesses must typically go through the PCI-express bus, which makes them very slow.2 But for single-use data, in-place accesses can be faster than CPU-GPU data transfer commands. And such memory may also be a faster source/destination when data transfers commands do get involved.
  • More advanced algorithms benefit from cache coherence guarantees. These guarantees are expensive to provide in a CPU/GPU distributed memory setup, and they are therefore not normally provided by default. Instead, such memory must be explicitly requested, usually at the expense of reducing performance of normal memory accesses.
  • Integrated GPUs that reside on the same package as a CPU make very different tradeoffs with respect to the typical setup described above. Sometimes they only see a single memory type corresponding to system RAM, sometimes a chunk of RAM is reserved out of system RAM to reduce CPU-GPU communication. Usually these GPUs enjoy faster CPU-GPU communication at the expense of reduced GPU performance.

While some of those properties emerge from the use of physically distinct hardware, others originate from memory controller configuration choices that can be dynamically made on a per-allocation basis. Vulkan acknowledges this by exposing two related sets of physical device metadata, namely memory types and memory heaps:

  • A memory heap represents a pool of GPU-accessible memory out of which storage blocks can be allocated. It has a few intrinsic properties exposed as memory heap flags, and can host allocations of one or more memory types.
  • A memory type is a particular memory allocation configuration that a memory heap supports. It has a number of properties that affect possible usage patterns and access performance, some of which are exposed to Vulkan applications via memory property flags.

In vulkano, memory types and heaps can be queried using the memory_properties() method of the PhysicalDevice struct. This course’s basic info utility will display some of this information at device detail level 2 and above, while the standard vulkaninfo will display all of it at the expense of a much more verbose output. Let’s look at an abriged version of vulkaninfo’s output for the GPU of the author’s primary work computer:

vulkaninfo
[ ... lots of verbose noise ... ]

Device Properties and Extensions:
=================================
GPU0:
VkPhysicalDeviceProperties:
---------------------------
        apiVersion        = 1.4.311 (4210999)
        driverVersion     = 25.1.3 (104861699)
        vendorID          = 0x1002
        deviceID          = 0x6981
        deviceType        = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName        = AMD Radeon Pro WX 3200 Series (RADV POLARIS12)
        pipelineCacheUUID = a7ef6108-0550-e213-559b-1bf8cda454df

[ ... more verbose noise ... ]

VkPhysicalDeviceMemoryProperties:
=================================
memoryHeaps: count = 2
        memoryHeaps[0]:
                size   = 33607798784 (0x7d32e5000) (31.30 GiB)
                budget = 33388290048 (0x7c618e000) (31.10 GiB)
                usage  = 0 (0x00000000) (0.00 B)
                flags:
                        None
        memoryHeaps[1]:
                size   = 4294967296 (0x100000000) (4.00 GiB)
                budget = 2420228096 (0x9041c000) (2.25 GiB)
                usage  = 0 (0x00000000) (0.00 B)
                flags: count = 1
                        MEMORY_HEAP_DEVICE_LOCAL_BIT
memoryTypes: count = 7
        memoryTypes[0]:
                heapIndex     = 1
                propertyFlags = 0x0001: count = 1
                        MEMORY_PROPERTY_DEVICE_LOCAL_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                color images
                                FORMAT_D16_UNORM
                                FORMAT_D32_SFLOAT
                                FORMAT_S8_UINT
                                FORMAT_D16_UNORM_S8_UINT
                                FORMAT_D32_SFLOAT_S8_UINT
                        IMAGE_TILING_LINEAR:
                                color images
        memoryTypes[1]:
                heapIndex     = 1
                propertyFlags = 0x0001: count = 1
                        MEMORY_PROPERTY_DEVICE_LOCAL_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                None
                        IMAGE_TILING_LINEAR:
                                None
        memoryTypes[2]:
                heapIndex     = 0
                propertyFlags = 0x0006: count = 2
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                color images
                                FORMAT_D16_UNORM
                                FORMAT_D32_SFLOAT
                                FORMAT_S8_UINT
                                FORMAT_D16_UNORM_S8_UINT
                                FORMAT_D32_SFLOAT_S8_UINT
                        IMAGE_TILING_LINEAR:
                                color images
        memoryTypes[3]:
                heapIndex     = 1
                propertyFlags = 0x0007: count = 3
                        MEMORY_PROPERTY_DEVICE_LOCAL_BIT
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                color images
                                FORMAT_D16_UNORM
                                FORMAT_D32_SFLOAT
                                FORMAT_S8_UINT
                                FORMAT_D16_UNORM_S8_UINT
                                FORMAT_D32_SFLOAT_S8_UINT
                        IMAGE_TILING_LINEAR:
                                color images
        memoryTypes[4]:
                heapIndex     = 1
                propertyFlags = 0x0007: count = 3
                        MEMORY_PROPERTY_DEVICE_LOCAL_BIT
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                None
                        IMAGE_TILING_LINEAR:
                                None
        memoryTypes[5]:
                heapIndex     = 0
                propertyFlags = 0x000e: count = 3
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                        MEMORY_PROPERTY_HOST_CACHED_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                color images
                                FORMAT_D16_UNORM
                                FORMAT_D32_SFLOAT
                                FORMAT_S8_UINT
                                FORMAT_D16_UNORM_S8_UINT
                                FORMAT_D32_SFLOAT_S8_UINT
                        IMAGE_TILING_LINEAR:
                                color images
        memoryTypes[6]:
                heapIndex     = 0
                propertyFlags = 0x000e: count = 3
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                        MEMORY_PROPERTY_HOST_CACHED_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                None
                        IMAGE_TILING_LINEAR:
                                None

[ ... even more verbose noise, other GPUs ... ]

As you can see, this AMD Radeon WX 3200 GPU can access memory that is allocated from two memory heaps, that together support seven memory types:

  • The first memory heap corresponds to half of available of system RAM, and represents its GPU-accessible subset. It supports three memory types that are all visible from the CPU (HOST_VISIBLE) and coherent with CPU caches (HOST_COHERENT). The latter means, among other things, that when the CPU writes to these memory regions the change will eventually become GPU-visible without using any special command.
    • Memory type 2 is not CPU-cached. This means that on the CPU side only sequential writes will perform well, but better CPU-to-GPU data transfer performance may be observed.
    • Memory type 5 is CPU-cached, which improves CPU read and random access performance at the risk of increasing the performance penalty for GPU accesses.
    • Memory type 6 is similar to memory type 5, but unlike the other two types it cannot be used for image allocations. Images are opaque memory objects used to leverage the GPU’s texturing units, which are beyond the scope of this introductory course.3
  • The second memory heap corresponds to the GPU’s dedicated VRAM, and comes with a DEVICE_LOCAL that indicates that it should be faster to access from the GPU. It supports four memory types that cover all possible combinations of the “can be read from the host/CPU” and “can be used for images” boolean truths.
    • Memory type 0 is not host-visible and can be used for images.
    • Memory type 1 is not host-visible and cannot be used for images.
    • Memory type 3 is host-visible, host-coherent, and can be used for images.
    • Memory type 4 is host-visible, host-coherent, and cannot be used for images.

You may be surprised by the way memory types are numbered, jumping from one memory heap to another. This ordering is unlikely to have been picked at random. Indeed, Vulkan requires that memory types be ordered by expected access performance, allowing applications to pick a good type with a simple “iterate over memory types and return the first one that fits the intended purpose” search loop. It is likely that this is part of4 what’s happening here.

In any case, now that we’ve gone through Vulkan memory heaps and types, let us start thinking about how our application might use them.

GPU data setup

Strategy

Our number-squaring program expects some initial data as input. Because this is a toy example, we could pick a simple input pattern that is easy to generate on the GPU (e.g. all-zero bytes).

But this is a special-purpose optimization as many real-world inputs can only come from the CPU side (think about e.g. inputs that are read from files). In the interest of covering the most general-purpose techniques, we will thus discuss how to get CPU-generated inputs into a GPU pipeline.

Depending on which Vulkan memory types are available, we may have up to three ways to perform this CPU-to-GPU data transfer:

  1. Allocate a block of memory that is device-local and host-visible. Directly write to it on the CPU side, then directly read from it on the GPU side.
  2. Allocate a block of memory that is NOT device-local but is host-visible. Use it as in #1.
  3. Allocate a block of memory that is device-local and another block of memory that is host-visible. Write to the host-visible block on the CPU side, then use a Vulkan command to copy its content to the device-local block, then read from the device-local block on the GPU side.

How do these options compare?

  • The Vulkan specification guarantees that a host-visible and a device-local memory type will be available, but does not guarantee that they will be the same memory type. Therefore options #2 and #3 are guaranteed to be available, but option #1 may not be available.
  • Accessing CPU memory from the GPU as in option #2 may only be faster than copying it as in #3 if the data is only used once, or the GPU code only uses a subset of it. Thus this method only makes sense for GPU compute pipelines that have specific properties.
  • Given the above, although allocating two blocks of memory and copying data from one to the other as in #3 increases the program’s memory footprint and code complexity, it can be seen as the most general-purpose approach. Whereas alternative methods #1 and #2 can be more efficient in specific situations, and should thus be explored as possible optimizations when the memory copy of method #3 becomes a performance bottleneck.

We will therefore mainly focus on the copying-based method during this course.

CPU buffer

We mentioned earlier that buffers are the core Vulkan abstraction for allocating and using memory blocks with a user-controlled data layout. But that was a bit of a logical shortcut. Several different Vulkan entities may actually get involved here:

  • Vulkan lets us allocate blocks of device-visible memory aka device memory.
  • Vulkan lets us create buffer objects, to which device memory can be bound. They supplement their backing memory with some metadata. Among other things this metadata tells the Vulkan implementation how we intend to use the memory, enabling some optimizations.
  • When manipulating images, we may also use buffer views, which are basically buffers full of image-like pixels with some extra metadata that describes the underlying pixel format.

As we have opted not to cover images in this course, we will not discuss buffer views further. But that still leaves us with the matter of allocating device memory and buffer with consistent properties (e.g. do not back a 1 MiB buffer with 4 KiB of device memory) and making that sure that a buffer never outlives the device memory that backs it at any point in time.

The vulkano API resolves these memory-safety issues by re-exposing the above Vulkan concepts through a stack of abstractions with slightly different naming:

  • RawBuffers exactly match Vulkan buffers and do not own their backing device memory. They are not meant to be used in everyday code, but rather to support advanced optimizations where the higher-level API does not fit. Using them requires unsafe operations.
  • A Buffer combines a RawBuffer with some backing device memory, making sure that the two cannot go out of sync in a manner that results in memory safety issues. It is the first memory-safe layer of the vulkano abstraction stack that can be used without unsafe.
  • A Subbuffer represents a subset of a Buffer defined by an offset and a size. It models the fact that most buffer-based Vulkan APIs also accept offset and range information, and again makes sure that this extra metadata is consistent with the underlying buffer object and device memory allocation. This is the object type that we will most often manipulate when manipulating buffers using vulkano.

By combining these abstractions with the rand crate for random number generation, we can create a CPU-visible buffer full of randomly generated numbers in the following manner:

use rand::{distr::Uniform, prelude::*};
use std::num::NonZeroUsize;
use vulkano::{
    buffer::{Buffer, BufferCreateInfo, BufferUsage, Subbuffer},
    memory::allocator::{AllocationCreateInfo, MemoryTypeFilter},
};

/// CLI parameters that guide input generation
#[derive(Debug, Args)]
pub struct InputOptions {
    /// Number of numbers to be squared
    #[arg(short, long, default_value = "1000")]
    pub len: NonZeroUsize,

    /// Smallest possible input value
    #[arg(long, default_value_t = 0.5)]
    pub min: f32,

    /// Largest possible input value
    #[arg(long, default_value_t = 2.0)]
    pub max: f32,
}

/// Set up a CPU-side input buffer with some random initial values
pub fn setup_cpu_input(context: &Context, options: &InputOptions) -> Result<Subbuffer<[f32]>> {
    // Configure the Vulkan buffer object
    let create_info = BufferCreateInfo {
        usage: BufferUsage::TRANSFER_SRC,
        ..Default::default()
    };

    // Configure the device memory allocation
    let allocation_info = AllocationCreateInfo {
        memory_type_filter: MemoryTypeFilter::PREFER_HOST | MemoryTypeFilter::HOST_SEQUENTIAL_WRITE,
        ..Default::default()
    };

    // Set up random input generation
    let mut rng = rand::rng();
    let range = Uniform::new(options.min, options.max)?;
    let numbers_iter = std::iter::repeat_with(|| range.sample(&mut rng)).take(options.len.get());

    // Put it all together by creating the vulkano Subbuffer
    let subbuffer = Buffer::from_iter(
        context.mem_allocator.clone(),
        create_info,
        allocation_info,
        numbers_iter,
    )?;
    Ok(subbuffer)
}

The main things that we specify here are that…

  • The buffer must be usable as the source of a Vulkan data transfer command.
  • The buffer should be allocated on the CPU side for optimal CPU memory access speed, in a way that is suitable for efficient sequential writes (i.e. uncached memory is fine).

But as you may imagine after having been exposed to Vulkan APIs for a while, there are many other things that we could potentially configure here:

  • On the BufferCreateInfo side, which controls creation of the Vulkan buffer object…
  • On the AllocationCreateInfo side, which controls allocation of device memory…
    • We may specify which Vulkan memory types should be used for the backing storage through a mixture of “must”, “should” and “should not” constraints.
    • We may hint the allocator towards or away from using dedicated device memory allocations, as opposed to sub-allocating from previously allocated device memory blocks.

GPU buffer

Our input data is now stored in a memory region that the GPU can access, but likely with suboptimal efficiency. The next step in our copy-based strategy will therefore be to allocate another buffer of matching characteristics from the fastest available device memory type. After that we may use a Vulkan copy command to copy our inputs from the slow “CPU side” to the fast “GPU side”.

Allocating the memory is not very interesting in and of itself, as we will just use a different Buffer constructor that lets us allocate an uninitialized memory block:

/// Set up an uninitialized GPU-side data buffer
pub fn setup_gpu_data(context: &Context, options: &InputOptions) -> Result<Subbuffer<[f32]>> {
    let usage = BufferUsage::TRANSFER_DST | BufferUsage::STORAGE_BUFFER | BufferUsage::TRANSFER_SRC;
    let subbuffer = Buffer::new_slice(
        context.mem_allocator.clone(),
        BufferCreateInfo {
            usage,
            ..Default::default()
        },
        AllocationCreateInfo::default(),
        options.len.get() as u64,
    )?;
    Ok(subbuffer)
}

The only thing worth noting here is that we are using buffer usage flags that anticipate the need to later bind this buffer to our compute pipeline (STORAGE_BUFFER) and get the computations’ outputs back into a CPU-accessible buffer at the end using another copy command (TRANSFER_SRC).

As you will see in the next chapter, however, the actual copying will be a bit more interesting.

Descriptor set

After a copy from the CPU side to the GPU side has been carried out (a process that we will not explain yet because it involves concepts covered in the next chapter), the GPU data buffer will contain a copy of our input data. We will then want to bind this data buffer to our compute pipeline, before we can execute this pipeline to square the inner numbers.

However, because Vulkan is not OpenGL, we cannot directly bind a data buffer to a compute pipeline. Instead, we will first need to build a descriptor set for this purpose.

We briefly mentioned descriptor sets in the previous chapter. To recall their purpose, they are Vulkan’s attempt to eliminate a performance bottleneck that plagued earlier GPU APIs, where memory resources used to be bound to compute and graphics pipelines one by one just before scheduling pipeline execution. These numerous resource binding API calls often ended up becoming an application performance bottleneck,5 so Vulkan improved upon them in two ways:

  • The binding mechanism is batched, so that an arbitrarily large amount of resources (up to ~millions on typical hardware) can be bound to a GPU pipeline with a single API call.
  • Applications can prepare resource bindings in advance during their initialization stage, so that actual binding calls later perform as little work as possible.

The product of these improvements is the descriptor set, which is a set of resources that is ready to be bound to a particular compute pipeline.6 And as usual, vulkano makes them rather easy to build and safely use compared to raw Vulkan:

use vulkano::descriptor_set::{DescriptorSet, WriteDescriptorSet};

/// Set up a descriptor set for binding the GPU buffer to the compute pipeline
pub fn setup_descriptor_set(
    context: &Context,
    pipeline: &Pipeline,
    buffer: Subbuffer<[f32]>,
) -> Result<Arc<DescriptorSet>> {
    // Configure which pipeline descriptor set this will bind to
    let set_layout = pipeline.layout.set_layouts()[DATA_SET as usize].clone();

    // Configure what resources will attach to the various bindings
    // that this descriptor set is composed of
    let descriptor_writes = [WriteDescriptorSet::buffer(DATA_BINDING, buffer)];

    // Set up the descriptor set accordingly
    let descriptor_set = DescriptorSet::new(
        context.desc_allocator.clone(),
        set_layout,
        descriptor_writes,
        [],
    )?;
    Ok(descriptor_set)
}

As you may guess by now, the empty array that is passed as a fourth parameter to the DescriptorSet::new() constructor gives us access to a Vulkan API feature that we will not use here. That feature lets us efficiently copy resource bindings from one descriptor set to another, which improves efficiency and ergonomics in situations where one needs to build descriptor sets that share some content but differ in other ways.7

Another vulkano-supported notion that we will not cover further in this course is that of variable descriptor set bindings. This maps into a SPIR-V/GLSL feature that enables descriptor sets to have a number of bindings that is not defined at shader compilation time. That way, GPU programs to access an array of resources whose length can vary from one execution to another.

Output buffer

After some number squaring has been carried out (which, again, will be the topic of the next chapter), we could go on and perform more computations on the GPU side, without ever getting any data back to the CPU side until the very end (or never, if the end result is a real-time visualization).

This is good because CPU-GPU data transfers are relatively slow and can easily become a performance bottleneck. But here our goal is to keep our first program example simple, so we will just get data back to the CPU side right away.

For this purpose, we will set up a dedicated output buffer on the CPU side:

/// Set up an uninitialized CPU-side output buffer
pub fn setup_cpu_output(context: &Context, options: &InputOptions) -> Result<Subbuffer<[f32]>> {
    let create_info = BufferCreateInfo {
        usage: BufferUsage::TRANSFER_DST,
        ..Default::default()
    };
    let allocation_info = AllocationCreateInfo {
        memory_type_filter: MemoryTypeFilter::PREFER_HOST | MemoryTypeFilter::HOST_RANDOM_ACCESS,
        ..Default::default()
    };
    let subbuffer = Buffer::new_slice(
        context.mem_allocator.clone(),
        create_info,
        allocation_info,
        options.len.get() as u64,
    )?;
    Ok(subbuffer)
}

However, this may leave you wondering why we are not reusing the CPU buffer that we have set up earlier for input initialization. With a few changes to our BufferCreateInfo and AllocationCreateInfo, we could set up a buffer that is suitable for both purposes, but there is an underlying tradeoff. Let’s look into the pros and cons of each approach:

  • Using separate input and output buffers consumes twice the amount of GPU-accessible system memory compared to using only one buffer.
  • Using separate input and output buffers lets us set fewer BufferUsage flags on each buffer, which may enable the implementation to perform more optimizations.
  • Using separate input and output buffers lets us leverage uncached host memory on the input side (corresponding to vulkano’s MemoryTypeFilter::HOST_SEQUENTIAL_WRITE allocation configuration), which may enable faster data transfers from the CPU to the GPU.
  • And perhaps most importantly, using separate input and output buffers lets us check result correctness at the end, which is important in any kind of course material :)

Overall, we could have done it both ways (and you can experiment with the other way as an exercise). But in the real world, the choice between these two approaches will depend on your performance priorities (data transfer speed vs memory utilization) and what benefit you will measure from the theoretically superior dual-buffer configuration on your target hardware.

In any case, the actual copy operation used to get data from the GPU buffer to this buffer will be covered in the next chapter, because as mentioned above copy commands use Vulkan command submission concepts that we have not introduced yet.

Conclusion

In this chapter, we have explored how Vulkan memory management works under the hood, and what vulkano does to make it easier on the Rust side. In particular, we have introduced the various ways we can get data in and out of the GPU. And we have seen how GPU-accessible buffers can be packaged into descriptor sets for the purpose of later binding them to a compute pipeline.

This paves the way for the last chapter, where we will finally put everything together into a working number-squaring computation. The main missing piece that we will cover there is the Vulkan command submisson and synchronization model, which will allow us to perform CPU-GPU data copies, bind resources and compute pipelines, execute said pipelines, and wait for GPU work.

Exercises

As you have seen in this chapter, the topics of Vulkan resource management and command scheduling are heavily intertwined, and any useful Vulkan-based program will feature a combination of both. The code presented in this chapter should thus be considered a work in progress, and it is not advisable to try executing and modifying it at this point. We have not yet introduced the right tools to make sure it works and assess its performance characteristics.

What you can already do, however, is copy the functions that have been presented throughout this chapter into the exercises/src/square.rs code module, and add some InputOptions to the Options struct of exercises/src/square.rs so that you are ready to pass in the right CLI arguments later.

Then stop and think. Vulkan is about choice, there is never only one way to do something. What other ways would you have to get data in and out of the GPU? How should they compare? And how would they affect the resource allocation code that is presented in this chapter?

As a hint to check how far along you are, a skim through this chapter should already give you 4 ways to initialize GPU buffers, 4 ways to exploit the results of a GPU computation, and 2 ways to set up CPU staging buffers in configurations where copies to and from the GPU are required.

Of course, going through this thought experiment will not give you an exhaustive list of all possible ways to perform these operations (which would include specialized tools like buffer clearing commands and system-specific extensions like RDMA). But it should provide you with good coverage of the general-purpose approaches that are available on most Vulkan-supported systems.


  1. This is partly the result of using a different memory technology, GDDR or HBM instead of standard DDR, and partly the result of GPUs having non-replaceable VRAM. The latter means that RAM chips can be soldered extremely close to compute chips and enjoy extra bandwidth by virtue of using a larger amount of shorter electrical connection wires. Several CPU models use a variation of this setup (Apple Mx, Intel Rapids series, …), but so far the idea of a general-purpose computer having its RAM capacity set in stone for its entire lifetime has not proved very popular.

  2. The “express” in PCI-express is relative to older versions of the PCI bus. This common CPU-GPU interconnect is unfortunately very low-bandwidth and high-latency by memory bus standards.

  3. A former version of this course used to leverage images because they make the GPU side of 2D/3D calculations nicer and enable new opportunities for hardware acceleration. But it was later discovered that this limited use of GPU texturing units is so much overkill that on many common GPUs it results in a net performance penalty compared to careful use of GPU buffers. Given that the use of images also adds a fair bit of complexity to the CPU-side setup code, this edition of the course decided to remove all uses of images in the joint interest of performance and CPU code simplicity.

  4. In this particular case, there is likely more to this story because the way AMD chose to enumerate their VRAM memory types means that no application allocation should ever end up using memory types 1 and 4. Indeed, these memory types can be used for buffers and not images, but they are respectively ordered after the memory types 0 and 3 that can be used for both buffers and images, and do not differ from types 1 and 4 in any other Vulkan-visible way. Buffer allocations using the “first memory type that fits” approach should thus end up using memory types 0 and 3 always. One possibility is that as with the “duplicate” queue families that we discussed before, there might be another property that distinguishes these two memory types, which cannot be queried from Vulkan but can be learned about by exploring manufacturer documentation. But at the time of writing, there is sadly no time for such an investigation, so we will leave this mystery for another day.

  5. Single-resource binding calls may seem reasonable at first glance, and are certainly good enough for the typical numerical computing application that only binds a couple of buffers per long-running compute pipeline execution. But the real-time 3D rendering workloads that Vulkan was designed for operate on tight real-time budgets (given a 60Hz monitor, a new video frame must be rendered every 16.7ms), may require thousands to millions of resource bindings, and involve complex multipass algorithms that may require resource rebinding between passes. For such applications, it is easy to see how even a small per-binding cost in the microsecond range can baloon up into an unacceptable amount of API overhead.

  6. To be precise, descriptor sets can be bound to any pipeline that has the same descriptor set layout. Advanced Vulkan users can leverage this nuance by sharing descriptor set layouts or even entire pipeline layouts across several compute and graphics pipelines. This allows them to amortize the API overheads of pipeline layout setup, but most importantly reduces the need to later set up and bind redundant descriptor sets when the same resources are bound to several related compute and graphics pipelines.

  7. If the pipeline executions that share some bindings run in succession, a more efficient alternative to this strategy is to extract the shared subset of the original descriptor set into a different descriptor set. This way, you can keep the descriptor set that corresponds to the common subset of bindings bound, and only rebind the descriptor sets that correspond to bindings that do change.