Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data & I/O

As before, after setting up our GPU compute pipelines, we will want to set up some data buffers that we can bind to those pipelines.

This process will be quite a bit simpler than before because we will not repeat the introduction to Vulkan memory management and will be using GPU-side initialization. So we will use the resulting savings in character budget to…

  • Show what it takes to integrate GPU data into our existing CPU simulation skeleton.
  • Follow the suggestion made in the number-squaring chapter to avoid having a single build_command_buffer() god-function that does all command buffer building.
  • Adjust our HDF5 I/O logic so that we do not need to download U concentration from the GPU.

GPU dataset

New code organization

The point of Vulkan descriptor sets is to allow your application to use as few of them as possible in order to reduce resource binding overhead. In the context of our Gray-Scott simulation, the lowest we can easily1 achieve is to have two descriptor sets.

  • One that uses two buffers (let’s call them U1 and V1) as inputs and two other buffers (let’s call them U2 and V2) as outputs.
  • Another that uses the same buffers, but flips the roles of input and output buffers. Using the above notation, U2 and U2 become the inputs, while U1 and U1 become the outputs.

Given this descriptor set usage scheme, our command buffers will alternatively bind these two descriptor sets, executing the simulation compute pipeline after each descriptor set binding call, and this will roughly replicate the double buffering pattern that we used on the CPU.

To get there, however, we will need to redesign our inner data abstractions a bit with respect to what we used to have on the CPU side. Indeed, back in the CPU course, we used to have the following separation of concerns in our code:

  • One struct called UV would represent a pair of tables of identical size and related contents, one representing the chemical concentration of species U and one representing the chemical concentration of species V.
  • Another struct called Concentrations would represent a pair of UV structs and implement the double buffering logic for alternatively using one of these UV structs to store inputs, and the other to store outputs.

But now that we have descriptor sets that combine inputs and outputs, this program decomposition scheme doesn’t work anymore. Which is why we will have to switch to a different scheme:

  • One struct called InOut will contain and manage all vulkano objects associated to one (U, V) input pair and one (U, V) output pair.
  • struct Concentrations will remain around, but be repurposed to manipulate pairs of InOut rather than pairs of UV. And users of its update() function will now only be exposed to a single DescriptorSet, instead of being exposed to a pair of UVs as in the CPU code.

Introducing InOut

Our new InOut data structure is going to look like this:

use std::sync::Arc;
use vulkano::{buffer::subbuffer::Subbuffer, descriptor_set::DescriptorSet};

/// Set of GPU inputs and outputs
struct InOut {
    /// Descriptor set used by GPU compute pipelines
    descriptor_set: Arc<DescriptorSet>,

    /// Input buffer for the V species, used during GPU-to-CPU data transfers
    input_v: Subbuffer<[Float]>,
}

As the comments point out, we are going to keep both a full input/output descriptor set and a V input buffer around, because they are useful for different tasks:

  • Compute pipeline execution commands operate over descriptor sets
  • Buffer-to-buffer data transfer commands operate over the underlying Subbuffer objects
  • Because descriptor sets are a very general-purpose abstraction, going from a descriptor set to the underlying buffer objects is a rather cumbersome process.
  • And because Subbuffer is jute a reference-counted pointer, it does not cost much performance to skip that cumbersome process by keeping around a V buffer reference.

Notice that we do not keep around the Subbuffer associated with the U species’ concentration, because we do not actually need it. We will get back to this.


For now, let us look at how an InOut is constructed:

use super::{
    options::RunnerOptions,
    pipeline::{IN, INOUT_SET, OUT},
};
use crate::{context::Context, Result};
use vulkano::{
    buffer::{Buffer, BufferCreateInfo, BufferUsage},
    descriptor_set::WriteDescriptorSet,
    memory::allocator::AllocationCreateInfo,
    pipeline::layout::PipelineLayout,
    DeviceSize,
};

/// Number of padding elements per side of the simulation domain
const PADDING_PER_SIDE: usize = 1;

/// Compute the padded version of a simulation dataset dimension (rows/cols)
fn padded(dimension: usize) -> usize {
    dimension + 2 * PADDING_PER_SIDE
}

impl InOut {
    /// Allocate a set of 4 buffers that can be used to store either U and V
    /// species concentrations, and can serve as inputs or outputs
    fn allocate_buffers(
        options: &RunnerOptions,
        context: &Context,
    ) -> Result<[Subbuffer<[Float]>; 4]> {
        use BufferUsage as BU;
        let padded_rows = padded(options.num_rows);
        let padded_cols = padded(options.num_cols);
        let new_buffer = || {
            Buffer::new_slice(
                context.mem_allocator.clone(),
                BufferCreateInfo {
                    usage: BU::STORAGE_BUFFER | BU::TRANSFER_DST | BU::TRANSFER_SRC,
                    ..Default::default()
                },
                AllocationCreateInfo::default(),
                (padded_rows * padded_cols) as DeviceSize,
            )
        };
        Ok([new_buffer()?, new_buffer()?, new_buffer()?, new_buffer()?])
    }

    /// Set up an `InOut` configuration by assigning roles to the 4 buffers that
    /// [`allocate_buffers()`](Self::allocate_buffers) previously allocated
    fn new(
        context: &Context,
        layout: &PipelineLayout,
        input_u: Subbuffer<[Float]>,
        input_v: Subbuffer<[Float]>,
        output_u: Subbuffer<[Float]>,
        output_v: Subbuffer<[Float]>,
    ) -> Result<Self> {
        // Determine how the descriptor set will bind to the compute pipeline
        let set_layout = layout.set_layouts()[INOUT_SET as usize].clone();

        // Configure what resources will attach to the various bindings
        // that the descriptor set is composed of
        let descriptor_writes = [
            WriteDescriptorSet::buffer_array(IN, 0, [input_u.clone(), input_v.clone()]),
            WriteDescriptorSet::buffer_array(OUT, 0, [output_u.clone(), output_v.clone()]),
        ];

        // Set up the descriptor set according to the above configuration
        let descriptor_set = DescriptorSet::new(
            context.desc_allocator.clone(),
            set_layout,
            descriptor_writes,
            [],
        )?;

        // Also keep track of the V input buffer, and we're done
        Ok(Self {
            descriptor_set,
            input_v,
        })
    }
}

The general idea here is that because our two InOuts will refer to the same buffers, we cannot allocate the buffers internally inside of the InOut::new() constructor. Instead we will need to allocate buffers inside of the code from Concentrations that builds InOuts, then use the same buffers twice in a different order to build the two different InOuts.

It is not so nice from an abstraction design point of view that the caller needs to know about such a thing as the right order in which buffers should be passed. But sadly this cannot be cleanly fixed at the InOut layer, so we will fix it at the Concentrations layer instead.

Updating Concentrations

In the CPU simulation, the Concentrations struct used to…

  • Contain a pair of UV values and a boolean that clarified their input/output role
  • Offload most initialization work to the lower UV layer
  • Expose an update() method whose user callback received both an immutable input (&UV) and a mutable output (&mut UV)

For the GPU simulation, as discussed earlier, we will switch to a different architecture:

  • Concentrations will now contain InOuts instead of UVs
  • InOut initialization will now be handled by the Concentrations layer, as it is the one that has easy access to the output buffers of each InOut
  • Initialization will now be asynchronous, as it entails some Vulkan commands that must be enqueued inside of a command buffer
  • The update() method will only receive a single DescriptorSet, as this contains all info needed to read inputs and write outputs

The switch to InOut is straightforward enough, and probably not worth discussing…

/// Double-buffered chemical species concentration storage
pub struct Concentrations {
    /// Compute pipeline input/output configurations
    ///
    /// If we denote `(U1, V1, U2, V2)` the underlying storage buffers...
    /// - The first "forward" configuration uses `(U1, V1)` as inputs and
    ///   `(U2, V2)` as outputs.
    /// - The second "reverse" configuration uses `(U2, V2)` as inputs and
    ///   `(U1, V1)` as outputs.
    ///
    /// By alternating between these two configurations, we can take as many
    /// simulation steps as we need to, always using the output of the
    /// simulation step N as the input of simulation step N+1.
    inout_sets: [InOut; 2],

    /// Truth that the second "revese" input/output configuration is active
    reversed: bool,
}

…however the constructor change will be quite a bit more substantial:

use super::{
    pipeline::Pipelines,
    CommandBufferBuilder,
};
use std::num::NonZeroU32;
use vulkano::pipeline::PipelineBindPoint;

impl Concentrations {
    /// Set up GPU data storage and schedule GPU buffer initialization
    ///
    /// GPU buffers will only be initialized after the command buffer associated
    /// with `cmdbuild` has been built and submitted for execution. Any work
    /// that depends on their initial value must be scheduled afterwards.
    pub fn create_and_schedule_init(
        options: &RunnerOptions,
        context: &Context,
        pipelines: &Pipelines,
        cmdbuild: &mut CommandBufferBuilder,
    ) -> Result<Self> {
        // Allocate all GPU storage buffers used by the simulation
        let [u1, v1, u2, v2] = InOut::allocate_buffers(options, context)?;

        // Set up input/output configurations
        let inout1 = InOut::new(
            context,
            &pipelines.layout,
            u1.clone(),
            v1.clone(),
            u2.clone(),
            v2.clone(),
        )?;
        let inout2 = InOut::new(context, &pipelines.layout, u2.clone(), v2.clone(), u1, v1)?;

        // Schedule initialization using the second descriptor set. The output
        // buffers of this descriptor set are the input buffers of the first
        // descriptor set, which will be used first.
        cmdbuild.bind_pipeline_compute(pipelines.init.clone())?;
        cmdbuild.bind_descriptor_sets(
            PipelineBindPoint::Compute,
            pipelines.layout.clone(),
            INOUT_SET,
            inout2.descriptor_set.clone(),
        )?;
        let num_workgroups = |domain_size: usize, workgroup_size: NonZeroU32| {
            padded(domain_size).div_ceil(workgroup_size.get() as usize) as u32
        };
        let padded_workgroups = [
            num_workgroups(options.num_cols, options.pipeline.workgroup_cols),
            num_workgroups(options.num_rows, options.pipeline.workgroup_rows),
            1,
        ];
        // SAFETY: GPU shader has been checked for absence of undefined behavior
        //         given a correct execution configuration, and this is one
        unsafe {
            cmdbuild.dispatch(padded_workgroups)?;
        }

        // Schedule zero-initialization of the edges of the first output.
        //
        // Only the edges need to be initialized. The values at the center of
        // the dataset do not matter, as these buffers will serve as simulation
        // outputs at least once (which will initialize their central values)
        // before they serve as a simulation input.
        //
        // Here we initialize the entire buffer to zero, as the Vulkan
        // implementation is likely to special-case this buffer-zeroing
        // operation with a high-performance implementation.
        cmdbuild.fill_buffer(u2.reinterpret(), 0)?;
        cmdbuild.fill_buffer(v2.reinterpret(), 0)?;

        // Once the command buffer is executed, everything will be ready
        Ok(Self {
            inout_sets: [inout1, inout2],
            reversed: false,
        })
    }

    // [ ... more methods coming up ... ]
}

As you can see, it is now the Concentrations constructor that is responsible for allocating storage for the underlying InOut structs and assigning input and output roles to it.

The initialization process also becomes a bit more complex:

  • The true simulation input is initialized using the init compute pipeline introduced earlier.
  • The other set of (U, V) buffers must now be initialized because we now use a zero-padding scheme to handle simulation domain edges. We do this by filling these buffers with zeroes, which is overkill but likely to be optimized in hardware.
  • Both of these operations are asynchronous Vulkan commands, so usage of the Concentrations is now a bit more complex as it now builds a command buffer that must be submitted to the GPU and executed. We acknowledge this by switching from the standard new() constructor naming to a more complex create_and_schedule_init() name that highlights what the user of this function needs to do.

When it comes to accessors, shape() will be dropped as it cannot be easily provided by our 1D GPU storage without keeping otherwise unnecessary metadata around. But the current() accessor will trivially be migrated to the new logic. And for reasons that will become clear later on, it can also become a private implementation detail of the underlying data module.

impl Concentrations {
    // [ ... ]

    /// Current input/output configuration
    fn current_inout(&self) -> &InOut {
        &self.inout_sets[self.reversed as usize]
    }

    // [ ... ]
}

On its side, the update() operation will be easily migrated to the new logic discussed above as well, as it is largely a simplification with respect to its former implementation:

impl Concentrations {
    // [ ... ]


    /// Run a simulation step
    ///
    /// The `step` callback will be provided with the descriptor set that should
    /// be used for the next simulation step. If you need to carry out multiple
    /// simulation steps, you should call `update()` once per simulation step.
    pub fn update(&mut self, step: impl FnOnce(Arc<DescriptorSet>) -> Result<()>) -> Result<()> {
        step(self.current_inout().descriptor_set.clone())?;
        self.reversed = !self.reversed;
        Ok(())
    }
}

There is just one new thing that we will need for GPU computing, which is the ability to report errors from GPU programs. This is handled by making the inner step callback return a Result<()>.

Output retrieval & storage

While InOut and Concentrations are enough for the purpose of setting up the simulation and running simulation steps, we are going to need one more thing for the purpose of retrieving GPU output on the CPU side. Namely a Vulkan buffer that the CPU can access.

We could adapt the old UV struct for this purpose, but if you pay attention to how the simulation output is actually used, you will notice that the io module only writes the V species’ concentration to the HDF5 file. And while passing an entire UV struct to this module anyway was fine when direct data access was possible, it is becoming wasteful if we now need to perform an expensive GPU-to-CPU transfer of the full (U, V) dataset only to use the V part exclusively later on.

Therefore, our new VBuffer abstraction will focus on retrieval of the V species’ concentration only.

The construction code is quite similar to the one seen before in InOut::allocate_buffers() (and in fact could be deduplicated with respect to it in a more production-grade codebase). The only thing that changed is that the BufferUsage and AllocationCreateInfo have been adjusted to make this buffer fit for the purpose of downloading data to the CPU:

/// CPU-accessible storage buffer used to download the V species' concentration
pub struct VBuffer {
    /// Buffer in which GPU data will be downloaded
    buffer: Subbuffer<[Float]>,

    /// Number of columns in the 2D concentration table, including zero padding
    padded_cols: usize,
}
//
impl VBuffer {
    /// Set up a `VBuffer`
    pub fn new(options: &RunnerOptions, context: &Context) -> Result<Self> {
        use vulkano::memory::allocator::MemoryTypeFilter as MTFilter;
        let padded_rows = padded(options.num_rows);
        let padded_cols = padded(options.num_cols);
        let buffer = Buffer::new_slice(
            context.mem_allocator.clone(),
            BufferCreateInfo {
                usage: BufferUsage::TRANSFER_DST,
                ..Default::default()
            },
            AllocationCreateInfo {
                memory_type_filter: MTFilter::PREFER_HOST | MTFilter::HOST_RANDOM_ACCESS,
                ..Default::default()
            },
            (padded_rows * padded_cols) as DeviceSize,
        )?;
        Ok(Self {
            buffer,
            padded_cols,
        })
    }

    // [ ... more methods coming up ... ]
}

After that, we can add a method to schedule a GPU-to-CPU data transfer…

use vulkano::command_buffer::CopyBufferInfo;

impl VBuffer {
    // [ ... ]

    /// Schedule a download of some [`Concentrations`]' current V input into
    /// the internal CPU-accessible buffer of this `VBuffer`
    ///
    /// The GPU-to-CPU download will only begin after the command buffer
    /// associated with `cmdbuild` has been built and submitted to the GPU for
    /// execution. You must wait for the associated GPU work to complete before
    /// processing the output with the [`process()`](Self::process) method.
    pub fn schedule_download(
        &mut self,
        source: &Concentrations,
        cmdbuild: &mut CommandBufferBuilder,
    ) -> Result<()> {
        cmdbuild.copy_buffer(CopyBufferInfo::buffers(
            source.current_inout().input_v.clone(),
            self.buffer.clone(),
        ))?;
        Ok(())
    }

    // [ ... ]
}

…and there is just one last piece to take care of, which is to provide a way to access the inner data after the download is complete. Which will require a bit more work than you may expect.

To set the stage, let’s point out that we are trying to set up some communication between two Rust libraries with the following API design.

  • To avoid data races between the CPU and the GPU, vulkano enforces an RAII design where accesses to a Subbuffer must go through the Subbuffer::read() method. This method returns a BufferReadGuard that borrows from the underlying Subbuffer and lets vulkano know at destruction time that it is not being accessed by the CPU anymore. Under the hood, locks and checks are then used to achieve thread safety.
  • We start from this BufferReadGuard, which borrows memory from the underlying Subbuffer storage like a standard Rust slice of type &[Float] could borrow from a Vec<Float>. And we want to add 2D layout information in order to turn it into an ndarray::ArrayView2<Float>, which is what the HDF5 binding that we are using ultimately expects.

Now, because the VBuffer type that we are building is logically a 2D array, it would be good API design from our side to refrain from exposing the underlying 1D vulkano dataset in the VBuffer API and instead only provide users with the ArrayView2 that they need for HDF5 I/O and other operations. While we are at it, we would also rather not expose the zero padding elements to the user, as they won’t be part of the final HDF5 file and are arguably an implementation detail of our current Gray-Scott simulation implementation.

We can get all of those good things, as it turns out, but the simplest way for us to get there2 will be a somewhat weird callback-based interface:

use ndarray::prelude::*;

impl VBuffer {
    // [ ... ]

    /// Process the latest download of the V species' concentrations
    ///
    /// Before calling this method, you will want to [schedule a
    /// download](Self::schedule_download), submit the resulting command buffer,
    /// and await its completion.
    ///
    /// The provided V species concentration table will only contain active
    /// elements, excluding zero-padding elements on the edge.
    pub fn process(&self, callback: impl FnOnce(ArrayView2<Float>) -> Result<()>) -> Result<()> {
        // Access the underlying dataset as a 1D slice
        let read_guard = self.buffer.read()?;

        // Create an ArrayView2 that covers the whole data, padding included
        let padded_cols = self.padded_cols;
        let padded_elements = read_guard.len();
        assert_eq!(padded_elements % padded_cols, 0);
        let padded_rows = padded_elements / padded_cols;
        let padded_view = ArrayView::from_shape([padded_rows, padded_cols], &read_guard)?;

        // Extract the central region of padded_view, excluding padding
        let data_view = padded_view.slice(s!(
            PADDING_PER_SIDE..(padded_rows - PADDING_PER_SIDE),
            PADDING_PER_SIDE..(padded_cols - PADDING_PER_SIDE),
        ));

        // We are now ready to run the user callback
        callback(data_view)
    }
}

The general idea here is that a user who wants to read the contents of the buffer will pass us a function (typically a lambda) that takes the current contents of the buffer (as an un-padded ArrayView2) and returns a Result<()> that tells if the operation is successful.

On our side, we will then do everything needed to set up the two-dimensional array view, pass it to the user-specified callback function, and return the result.

HDF5 I/O refactor

As mentioned earlier, one last thing that should change with respect to our former CPU code is that we want our HDF5 I/O module to be clearer about what it wants.

Indeed, at present time, HDF5Writer::write() demands a full set of (U, V) data of which it only uses the V concentration data. This was fine from a CPU programming perspective where we don’t pay for exposing unused data access opportunities. But from a GPU programming perspective it means downloading U concentration data that the HDF5 I/O module is not going to use.

We will fix this by making the HDF5Writer more explicit about what it wants, and having it take the V species concentration only instead.

// In exercises/src/grayscott/io.rs

use ndarray::ArrayView2;

impl HDF5Writer {
    // [ ... ]

    /// Write a new V species concentration table to the file
    pub fn write(&mut self, v: ArrayView2<Float>) -> hdf5::Result<()> {
        // FIXME: Workaround for an HDF5 binding limitation
        let v = v.to_owned();
        self.dataset.write_slice(&v, (self.position, .., ..))?;
        self.position += 1;
        Ok(())
    }

    // [ ... ]
}

Notice the FIXME above. Apparently, the Rust HDF5 binding we are using does not yet handle ArrayView2s whose rows are not contiguous in memory, which means that we must create a contiguous copy of v before it accepts to write it to a file.

From the author’s understanding of the HDF5 C API, it can handle this, and this is a Rust binding limitation that should be fixed. But until a fix happens, making an owned contiguous copy should be a reasonably efficient workaround, as for typical storage devices in-RAM copies are much faster than writing data to the target storage device.

As an alternative, we could also modify our GPU-to-CPU copy logic so that it does not copy the padding zero elements, saving a bit of CPU-GPU interconnect bandwidth along the way. However this will require us to stop using standard Vulkan copy commands and use custom shaders for this purpose instead, which may in turn cause two issues:

  • Performance may be worse, because the standard Vulkan copy command should have been well-optimized by the GPU vendor. Our shader would need to be optimized similarly for all GPU devices on which we want to perform well, which is a lot of work.
  • We will very likely lose the ability to overlap GPU-to-CPU copies with computations, which we are not using yet but may want to use later as an optimization.

As always, tradeoffs are the name of the game in engineering… but as you will see later, this particular tradeoff is going to disappear once we introduce other optimizations anyway.

Exercise

In the data module of the Gray-Scott reaction simulation (exercises/src/grayscott/data.rs), replace the UV and Concentrations structs with the InOut, Concentrations and VBuffer types introduced in this chapter.

After that is done, proceed to modify the io module of the simulation so that it works with borrowed V concentration data only, as discussed above.

You will find that the simulation does not compile at this point. This is expected because the run_simulation() and update() function of the simulation library have not been updated yet, and the CommandBufferBuilder type alias has not been defined yet either. We will fix that in the next chapter, for now just make sure that there is no compilation error originating from a mistake in data.rs or io.rs.


  1. Without losing the benefits of GLSL’s readonly and writeonly qualifiers and introducing new Vulkan concepts like push constants, that is.

  2. It is possible to write a callback-free read() method that returns an object that behaves like an ArrayView2, but implementing it efficiently (without recreating the ArrayView2 on every access) involves building a type that is self-referential in the eyes of the Rust’s compiler lifetime analysis. Which means that some dirty unsafe tricks will be required.