Data & I/O

As before, after setting up our GPU compute pipelines, we will want to set up some data buffers that we can bind to those pipelines.

This process will be quite a bit simpler than before because we will not repeat the introduction to Vulkan memory management and will be using GPU-side initialization. So we will use the resulting savings in character budget to…

Show what it takes to integrate GPU data into our existing CPU simulation skeleton.
Follow the suggestion made in the previous execution chapter to break down the build_command_buffer() god-function that we used to have into multiple functions that each add a smaller amount of work to a command buffer.
Adjust our HDF5 I/O logic so that we do not need to retrieve U concentration data from the GPU that we do not actually need.

GPU dataset

New code organization

The point of Vulkan descriptor sets is to allow your application to use as few of them as possible in order to reduce resource binding overhead. In the context of our Gray-Scott simulation, the lowest we can easily¹ achieve is to have two descriptor sets.

One that uses two buffers (let’s call them U1 and V1) as inputs and two other buffers (let’s call them U2 and V2) as outputs.
Another that uses the same buffers, but flips the roles of input and output buffers. Using the above notation, U2 and U2 become the inputs, while U1 and U1 become the outputs.

Given this descriptor set usage scheme, our command buffers will alternatively bind these two descriptor sets, executing the simulation compute pipeline after each descriptor set binding call, and this will roughly replicate the double buffering pattern that we used on the CPU.

To get there, however, we will need to redesign our inner data abstractions a bit with respect to what we used to have on the CPU side. Indeed, back in the CPU course, we used to have the following separation of concerns in our code:

One struct called UV would represent a pair of tables of identical size and related contents, one representing the chemical concentration of species U and one representing the chemincal concentration of species V.
Another struct called Concentrations would represent a pair of UV structs and implement the double buffering logic for alternatively using one of these UV structs to store inputs, and the other to store outputs.

But now that we have descriptor sets that combine inputs and outputs, this program decomposition scheme doesn’t work anymore. Which is why we will have to switch to a different scheme:

One struct called UVSet will contain and manage all vulkano objects associated to one (U, V) input pair and one (U, V) output pair.
struct Concentrations will remain around, but be repurposed to manipulate pairs of UVSet rather than pairs of UV. And users of its update() function will now only be exposed to a single UVSet, instead of being exposed to a pair of UVs.

Introducing `UVSet`

Our new UVSet data structure is going to look like this:

use std::sync::Arc;
use vulkano::{buffer::subbuffer::Subbuffer, descriptor_set::DescriptorSet};

// Throughout this module, we will model (U, V) pairs as arrays of two values
// of identical type with the following indexing convention
const U: usize = 0;
const V: usize = 1;

/// GPU-side input and output (U, V) concentration table pairs.
pub struct UVSet {
    /// Descriptor set used by GPU compute pipelines
    pub descriptor_set: Arc<DescriptorSet>,

    /// Input buffers from `set`, used during GPU-to-CPU data transfers
    input_uv: [Subbuffer<[Float]>; 2],
}

As the comments point out, we are going to keep both the descriptor set and the (U, V) buffer pair around, because they are useful for different tasks:

Compute pipeline execution commands operate over descriptor sets
Buffer-to-buffer data transfer commands operate over the underlying Subbuffer objects
Because descriptor sets are a very general-purpose abstraction, going from a descriptor set to the underlying buffer objects is a rather cumbersome process.
And because Subbuffer is jute a reference-counted pointer, it does not cost us much to skip that cumbersome process by keeping around a direct reference to the underlying buffer.

Now, let us look at how an UVSet is actually set up:

use super::{
    options::Options,
    pipeline::{DATA_SET, IN, OUT},
};
use crate::{context::Context, Result};
use vulkano::{
    buffer::{Buffer, BufferCreateInfo, BufferUsage},
    descriptor_set::WriteDescriptorSet,
    memory::allocator::AllocationCreateInfo,
    pipeline::layout::PipelineLayout,
    DeviceSize,
};

/// Number of padding elements per side of the simulation domain
const PADDING_PER_SIDE: usize = 1;

impl UVSet {
    /// Allocate a set of 4 buffers that can be used to store either U and V
    /// species and can serve as an input or output.
    fn allocate_buffers(options: &Options, context: &Context) -> Result<Box<[Subbuffer<[Float]>]>> {
        let padded_rows = options.runner.num_rows + 2 * PADDING_PER_SIDE;
        let padded_cols = options.runner.num_cols + 2 * PADDING_PER_SIDE;
        let buffers = std::iter::repeat_with(|| {
            Buffer::new_slice(
                context.mem_allocator.clone(),
                BufferCreateInfo {
                    usage: BU::STORAGE_BUFFER | BU::TRANSFER_DST | BU::TRANSFER_SRC,
                    ..Default::default()
                },
                AllocationCreateInfo::default(),
                (padded_rows * padded_cols) as DeviceSize,
            )
        })
        .take(4)
        .collect::<std::result::Result<Box<[_]>, _>>()?;
        Ok(buffers)
    }

    /// Set up an `UVSet` by assigning roles to the 4 buffers that
    /// `allocate_buffers()` previously allocated.
    fn new(
        context: &Context,
        layout: &PipelineLayout,
        in_u: Subbuffer<[Float]>,
        in_v: Subbuffer<[Float]>,
        out_u: Subbuffer<[Float]>,
        out_v: Subbuffer<[Float]>,
    ) -> Result<Self> {
        // Configure which pipeline descriptor set this will bind to
        let set_layout = layout.set_layouts()[DATA_SET as usize].clone();

        // Configure what resources will attach to the various bindings
        // that this descriptor set is composed of
        let descriptor_writes = [
            WriteDescriptorSet::buffer_array(IN, 0, [in_u.clone(), in_v.clone()]),
            WriteDescriptorSet::buffer_array(OUT, 0, [out_u.clone(), out_v.clone()]),
        ];

        // Set up the descriptor set accordingly
        let descriptor_set = DescriptorSet::new(
            context.desc_allocator.clone(),
            set_layout,
            descriptor_writes,
            [],
        )?;

        // Put it all together
        Ok(Self {
            descriptor_set,
            input_uv: [in_u, in_v],
        })
    }
}

The general idea here is that because our two UVSets will refer to the same buffers, we cannot allocate the buffers internally inside of the UVSet::new() constructor. Instead we will need to allocate buffers inside of the code from Concentrations that builds UVSets, then use the same buffers twice in a different order to build the two different UVsets.

Obviously, is not so nice from an abstraction design point of view that the caller needs to know about such a thing as the right order in which buffers should be passed. But sadly this cannot be cleanly fixed at the UVSet layer, so we will fix it at the Concentrations layer instead.

Updating `Concentrations`

In the CPU simulation, the Concentrations struct used to…

…contain a pair of UV values and a boolean that clarified their input/output role,
…offload most initialization work to the lower UV layer,
…and expose an update() method whose user callback received both an immutable input (&UV) and a mutable output (&mut UV).

For the GPU simulation, we will change this as follows:

Concentrations will now contain UVSets instead of UVs.
UVSet initialization will now be handled by the Concentrations layer, as it is the one that has easy access to the output buffers of each UVSet.
The update() method will only receive a single &UVSet, as this contains all info needed to read inputs and write outputs.

The switch to UVSet is straightforward enough, and probably not worth discussing…

pub struct Concentrations {
    sets: [UVSet; 2],
    src_is_1: bool,
}

…however the constructor change will obviously be quite a bit more substantial:

use super::pipeline::Pipelines;
use vulkano::{
    command_buffer::auto::{AutoCommandBufferBuilder, PrimaryAutoCommandBuffer},
    pipeline::PipelineBindPoint,
};

impl Concentrations {
    /// Set up the GPU simulation state and schedule GPU buffer initialization
    pub fn create_and_schedule_init(
        options: &Options,
        context: &Context,
        pipelines: &Pipelines,
        cmdbuf: &mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
    ) -> Result<Self> {
        // Allocate all GPU buffers
        let [u1, v1, u2, v2] = &UVSet::allocate_buffers(options, context)?[..] else {
            panic!("Unexpected number of data buffers")
        };

        // Set up the associated UV sets
        let set1 = UVSet::new(
            context,
            &pipelines.layout,
            u1.clone(),
            v1.clone(),
            u2.clone(),
            v2.clone(),
        )?;
        let set2 = UVSet::new(
            context,
            &pipelines.layout,
            u2.clone(),
            v2.clone(),
            u1.clone(),
            v1.clone(),
        )?;

        // Schedule the initialization of the second output, which is the first
        // input, and therefore the overall simulation input.
        cmdbuf.bind_pipeline_compute(pipelines.init.clone())?;
        cmdbuf.bind_descriptor_sets(
            PipelineBindPoint::Compute,
            pipelines.layout.clone(),
            DATA_SET,
            set2.descriptor_set.clone(),
        )?;
        let padded_workgroups = [
            (options.runner.num_cols + 2 * PADDING_PER_SIDE)
                .div_ceil(options.pipeline.workgroup_cols.get() as usize) as u32,
            (options.runner.num_rows + 2 * PADDING_PER_SIDE)
                .div_ceil(options.pipeline.workgroup_rows.get() as usize) as u32,
            1,
        ];
        // SAFETY: GPU shader has been checked for absence of undefined behavior
        //         given a correct execution configuration, and this is one
        unsafe {
            cmdbuf.dispatch(padded_workgroups)?;
        }

        // Schedule the zero-initialization of the edges of the first output.
        // The center of it will be overwritten by the first simulation step,
        // so it can have any value we like, therefore it can be zeroed as well.
        cmdbuf.fill_buffer(u2.clone().reinterpret(), 0)?;
        cmdbuf.fill_buffer(v2.clone().reinterpret(), 0)?;

        // Once cmdbuf is done initializing, we will be done
        Ok(Self {
            sets: [set1, set2],
            src_is_1: false,
        })
    }

    // [ ... more methods coming up ... ]
}

The shape() accessor will be dropped, as it cannot be easily provided by our GPU storage without keeping otherwise unnecessary metadata around, but the current() accessor will trivially be migrated to the new logic…

impl Concentrations {
    // [ ... ]

    /// Read out the current species concentrations
    pub fn current(&self) -> &UVSet {
        &self.sets[self.src_is_1 as usize]
    }

    // [ ... ]
}

…and the update() operation will be easily migrated to the new logic discussed above as well, as it is largely a simplification with respect to its former implementation. There is just one new thing that we will need for GPU computing, which is the ability to report errors from GPU programs.

impl Concentrations {
    // [ ... ]


    /// Run a simulation step
    pub fn update(&mut self, step: impl FnOnce(&UVSet) -> Result<()>) -> Result<()> {
        step(self.current())?;
        self.src_is_1 = !self.src_is_1;
        Ok(())
    }
}

Output retrieval & storage

While UVSet and Concentrations are enough for the purpose of setting up the simulation and running simulation steps, we are going to need one more thing for the purpose of retrieving GPU output on the CPU side. Namely a Vulkan buffer that the CPU can access.

We could reuse the UV struct for this purpose, but if you pay attention to how the simulation output is actually used, you will notice that the io module only writes the V species’ concentration to the HDF5 file. And while passing an entire UV struct to this module anyway was fine when direct data access was possible, it is becoming wasteful if we now need to perform an expensive GPU-to-CPU transfer of the full (U, V) dataset only to use the V part exclusively later on.

Therefore, our new VBuffer abstraction will focus on retrieval of the V species’ concentration only, until a new use case comes up someday where the U species’ concentration becomes useful too.

The construction code is quite similar to the one seen before in UVSet::allocate_buffers() (and in fact should probably be deduplicated with respect to it in a more production-grade codebase). The only thing that changed is that the BufferUsage and AllocationCreateInfo have been adjusted to make this buffer fit for the purpose of downloading data to the CPU:

/// CPU-accessible storage buffer containing the V species' concentration
pub struct VBuffer {
    buffer: Subbuffer<[Float]>,
    padded_cols: usize,
}
//
impl VBuffer {
    /// Set up a `VBuffer`
    pub fn new(options: &Options, context: &Context) -> Result<Self> {
        let padded_rows = options.runner.num_rows + 2 * PADDING_PER_SIDE;
        let padded_cols = options.runner.num_cols + 2 * PADDING_PER_SIDE;
        use vulkano::memory::allocator::MemoryTypeFilter as MTFilter;
        let buffer = Buffer::new_slice(
            context.mem_allocator.clone(),
            BufferCreateInfo {
                usage: BufferUsage::TRANSFER_DST,
                ..Default::default()
            },
            AllocationCreateInfo {
                memory_type_filter: MTFilter::PREFER_HOST | MTFilter::HOST_RANDOM_ACCESS,
                ..Default::default()
            },
            (padded_rows * padded_cols) as DeviceSize,
        )?;
        Ok(Self {
            buffer,
            padded_cols,
        })
    }

    // [ ... more methods coming up ... ]
}

After that, we can add a method to schedule a GPU-to-CPU data transfer…

use vulkano::command_buffer::CopyBufferInfo;

impl VBuffer {
    // [ ... ]

    /// Schedule an update of this buffer from an `UVSet`'s current input
    pub fn schedule_update(
        &mut self,
        cmdbuf: &mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
        source: &UVSet,
    ) -> Result<()> {
        cmdbuf.copy_buffer(CopyBufferInfo::buffers(
            source.input_uv[V].clone(),
            self.buffer.clone(),
        ))?;
        Ok(())
    }

    // [ ... ]
}

…and there is just one last piece to take care of, which is to provide a way to access the inner data. Which will require a bit more work than you may expect.

To set the stage, let’s point out that we are trying to set up some communication between two Rust libraries with the following API design.

To achieve memory safety in the presence of a risk of data races between the CPU and the GPU, vulkano enforces an RAII design where accesses to a Subbuffer must go through the Subbuffer::read() method. This method returns a BufferReadGuard that borrows from the underlying Subbuffer and lets vulkano know at destruction time that it is not being accessed by the CPU anymore. Under the hood, locking and checks are then used to achieve safety.
Starting from this BufferReadGuard, which borrows memory from the underlying Subbuffer storage like a standard Rust slice of type &[Float] could borrow from a Vec<Float>, we want to add 2D layout information to it in order to turn it into an ndarray::ArrayView2<Float>, which is what the HDF5 binding that we are using ultimately expects.

Now, because the VBuffer type that we are building is logically a 2D array, it would be good API design from our side to refrain from exposing the underlying 1D dataset in the VBuffer API and instead only provide users with the ArrayView2 that they need for HDF5 I/O and other operations. While we are at it, we would also rather not expose the zero padding elements to the user, as they won’t be part of the final HDF5 file and are arguably an implementation detail of our current Gray-Scott simulation implementation.

We can get all of those good things, as it turns out, but the simplest way for us to get there² will be a somewhat weird callback-based interface:

use ndarray::prelude::*;

impl VBuffer {
    // [ ... ]

    /// Access the inner V species concentration as a 2D array without padding
    ///
    /// Before calling this method, you will generally want to schedule an
    /// update, submit the resulting command buffer, and await its completion.
    pub fn read_and_process<R>(&self, callback: impl FnOnce(ArrayView2<Float>) -> R) -> Result<R> {
        // Access the underlying dataset as a 1D slice
        let read_guard = self.buffer.read()?;

        // Create an ArrayView2 that covers the whole data, padding included
        let padded_cols = self.padded_cols;
        let padded_elements = read_guard.len();
        assert_eq!(padded_elements % padded_cols, 0);
        let padded_rows = padded_elements / padded_cols;
        let padded_view = ArrayView::from_shape([padded_rows, padded_cols], &read_guard)?;

        // Extract the central region of padded_view, excluding padding
        let data_view = padded_view.slice(s!(
            PADDING_PER_SIDE..(padded_rows - PADDING_PER_SIDE),
            PADDING_PER_SIDE..(padded_cols - PADDING_PER_SIDE),
        ));

        // We're now ready to run the user callback
        Ok(callback(data_view))
    }
}

The general idea here is that a user who wants to read the contents of the buffer will pass us a function (typically a lambda) that takes the current contents of the buffer (as an un-padded ArrayView2) and returns a result of an arbitrary type R that we do not care about. On our side, we will then proceed to do everything needed to set up the two-dimensional array view, call the user-specified function, and return the result.

HDF5 I/O refactor

As mentioned earlier, one last thing that should change with respect to our former CPU code is that we want our HDF5 I/O module to be clearer about what it wants.

Indeed, at present time, HDF5Writer::write() demands a full set of (U, V) data of which it only uses the V concentration data. This was fine from a CPU programming perspective where we don’t pay for exposing unused data access opportunities, but from a GPU programming perspective it means downloading U concentration data that the HDF5 I/O module is not going to use.

We will fix this by making the HDF5Writer more explicit about what it wants, and having it take the V species concentration only instead.

// In exercises/src/grayscott/io.rs

use ndarray::ArrayView2;

impl HDF5Writer {
    // [ ... ]

    /// Write a new V species concentration table to the file
    pub fn write(&mut self, v: ArrayView2<Float>) -> hdf5::Result<()> {
        // FIXME: Workaround for an HDF5 binding limitation
        let v = v.to_owned();
        self.dataset.write_slice(&v, (self.position, .., ..))?;
        self.position += 1;
        Ok(())
    }

    // [ ... ]
}

Notice the FIXME above. Apparently, the Rust HDF5 binding we are using does not yet handle ArrayView2s whose rows are not contiguous in memory, which means that we must create a contiguous copy of v before it accepts to write it to a file.

From the author’s memory of the HDF5 C API, it can handle this, and this limitation is specific to the Rust bindings that should be fixed. Until a fix happens, however, making an owned contiguous copy should be a reasonably efficient workaround, as for typical storage devices in-RAM copies are much faster than writing data to the target storage device.

As an alternative, we could also modify our GPU-to-CPU copy logic so that it does not copy the padding zero elements, saving a bit of CPU-GPU interconnect bandwidth along the way. However this will require us to stop using standard Vulkan copy commands and use custom shaders for this purpose instead, which may in turn cause two issues:

Performance may be worse, because the standard Vulkan copy command should have been well-optimized by the GPU vendor. Our shader would need to be optimized similarly.
We will very likely lose the ability to overlap GPU-to-CPU copies with computations, which we are not using yet but may want to use later as an optimization.

As always, tradeoffs are the name of the game in engineering…

Exercise

In the data module of the Gray-Scott reaction simulation (exercises/src/grayscott/data.rs), replace the UV and Concentrations structs with the UVSet, Concentrations and VBuffer types introduced in this chapter.

After that is done, proceed to modify the io module of the simulation so that it works with borrowed V concentration data only, as discussed above.

You will find that the simulation does not compile at this point. This is expected because the run_simulation() and update() function of the simulation library have not been updated yet. We will fix that in the next chapter, for now just make sure that there is no compilation error originating from a mistake in data.rs or io.rs.

Without losing the benefits of GLSL’s readonly and writeonly qualifiers and introducing new Vulkan concepts like push constants, that is. ↩
It is possible to write a callback-free read() method that returns an object that behaves like an ArrayView2, but implementing it efficiently (without recreating the ArrayView2 on every access) involves building a type that is self-referential in the eyes of the Rust’s compiler lifetime analysis. Which means that some dirty unsafe tricks will be required. ↩

Keyboard shortcuts

Numerical Computing with Rust on GPU