Data & I/O
As before, after setting up our GPU compute pipelines, we will want to set up some data buffers that we can bind to those pipelines.
This process will be quite a bit simpler than before because we will not repeat the introduction to Vulkan memory management and will be using GPU-side initialization. So we will use the resulting savings in character budget to…
- Show what it takes to integrate GPU data into our existing CPU simulation skeleton.
- Follow the suggestion made in the number-squaring chapter to avoid having a
single
build_command_buffer()god-function that does all command buffer building. - Adjust our HDF5 I/O logic so that we do not need to download U concentration from the GPU.
GPU dataset
New code organization
The point of Vulkan descriptor sets is to allow your application to use as few of them as possible in order to reduce resource binding overhead. In the context of our Gray-Scott simulation, the lowest we can easily1 achieve is to have two descriptor sets.
- One that uses two buffers (let’s call them U1 and V1) as inputs and two other buffers (let’s call them U2 and V2) as outputs.
- Another that uses the same buffers, but flips the roles of input and output buffers. Using the above notation, U2 and U2 become the inputs, while U1 and U1 become the outputs.
Given this descriptor set usage scheme, our command buffers will alternatively bind these two descriptor sets, executing the simulation compute pipeline after each descriptor set binding call, and this will roughly replicate the double buffering pattern that we used on the CPU.
To get there, however, we will need to redesign our inner data abstractions a bit with respect to what we used to have on the CPU side. Indeed, back in the CPU course, we used to have the following separation of concerns in our code:
- One
structcalledUVwould represent a pair of tables of identical size and related contents, one representing the chemical concentration of speciesUand one representing the chemical concentration of speciesV. - Another
structcalledConcentrationswould represent a pair ofUVstructs and implement the double buffering logic for alternatively using one of theseUVstructs to store inputs, and the other to store outputs.
But now that we have descriptor sets that combine inputs and outputs, this program decomposition scheme doesn’t work anymore. Which is why we will have to switch to a different scheme:
- One
structcalledInOutwill contain and manage allvulkanoobjects associated to one(U, V)input pair and one(U, V)output pair. struct Concentrationswill remain around, but be repurposed to manipulate pairs ofInOutrather than pairs ofUV. And users of itsupdate()function will now only be exposed to a singleDescriptorSet, instead of being exposed to a pair ofUVs as in the CPU code.
Introducing InOut
Our new InOut data structure is going to look like this:
use std::sync::Arc;
use vulkano::{buffer::subbuffer::Subbuffer, descriptor_set::DescriptorSet};
/// Set of GPU inputs and outputs
struct InOut {
/// Descriptor set used by GPU compute pipelines
descriptor_set: Arc<DescriptorSet>,
/// Input buffer for the V species, used during GPU-to-CPU data transfers
input_v: Subbuffer<[Float]>,
}
As the comments point out, we are going to keep both a full input/output
descriptor set and a V input buffer around, because they are useful for
different tasks:
- Compute pipeline execution commands operate over descriptor sets
- Buffer-to-buffer data transfer commands operate over the underlying
Subbufferobjects - Because descriptor sets are a very general-purpose abstraction, going from a descriptor set to the underlying buffer objects is a rather cumbersome process.
- And because
Subbufferis jute a reference-counted pointer, it does not cost much performance to skip that cumbersome process by keeping around a V buffer reference.
Notice that we do not keep around the Subbuffer associated with the U species’
concentration, because we do not actually need it. We will get back to this.
For now, let us look at how an InOut is constructed:
use super::{
options::RunnerOptions,
pipeline::{IN, INOUT_SET, OUT},
};
use crate::{context::Context, Result};
use vulkano::{
buffer::{Buffer, BufferCreateInfo, BufferUsage},
descriptor_set::WriteDescriptorSet,
memory::allocator::AllocationCreateInfo,
pipeline::layout::PipelineLayout,
DeviceSize,
};
/// Number of padding elements per side of the simulation domain
const PADDING_PER_SIDE: usize = 1;
/// Compute the padded version of a simulation dataset dimension (rows/cols)
fn padded(dimension: usize) -> usize {
dimension + 2 * PADDING_PER_SIDE
}
impl InOut {
/// Allocate a set of 4 buffers that can be used to store either U and V
/// species concentrations, and can serve as inputs or outputs
fn allocate_buffers(
options: &RunnerOptions,
context: &Context,
) -> Result<[Subbuffer<[Float]>; 4]> {
use BufferUsage as BU;
let padded_rows = padded(options.num_rows);
let padded_cols = padded(options.num_cols);
let new_buffer = || {
Buffer::new_slice(
context.mem_allocator.clone(),
BufferCreateInfo {
usage: BU::STORAGE_BUFFER | BU::TRANSFER_DST | BU::TRANSFER_SRC,
..Default::default()
},
AllocationCreateInfo::default(),
(padded_rows * padded_cols) as DeviceSize,
)
};
Ok([new_buffer()?, new_buffer()?, new_buffer()?, new_buffer()?])
}
/// Set up an `InOut` configuration by assigning roles to the 4 buffers that
/// [`allocate_buffers()`](Self::allocate_buffers) previously allocated
fn new(
context: &Context,
layout: &PipelineLayout,
input_u: Subbuffer<[Float]>,
input_v: Subbuffer<[Float]>,
output_u: Subbuffer<[Float]>,
output_v: Subbuffer<[Float]>,
) -> Result<Self> {
// Determine how the descriptor set will bind to the compute pipeline
let set_layout = layout.set_layouts()[INOUT_SET as usize].clone();
// Configure what resources will attach to the various bindings
// that the descriptor set is composed of
let descriptor_writes = [
WriteDescriptorSet::buffer_array(IN, 0, [input_u.clone(), input_v.clone()]),
WriteDescriptorSet::buffer_array(OUT, 0, [output_u.clone(), output_v.clone()]),
];
// Set up the descriptor set according to the above configuration
let descriptor_set = DescriptorSet::new(
context.desc_allocator.clone(),
set_layout,
descriptor_writes,
[],
)?;
// Also keep track of the V input buffer, and we're done
Ok(Self {
descriptor_set,
input_v,
})
}
}
The general idea here is that because our two InOuts will refer to the same
buffers, we cannot allocate the buffers internally inside of the InOut::new()
constructor. Instead we will need to allocate buffers inside of the code from
Concentrations that builds InOuts, then use the same buffers twice in a
different order to build the two different InOuts.
It is not so nice from an abstraction design point of view that the caller needs
to know about such a thing as the right order in which buffers should be passed.
But sadly this cannot be cleanly fixed at the InOut layer, so we will fix it
at the Concentrations layer instead.
Updating Concentrations
In the CPU simulation, the Concentrations struct used to…
- Contain a pair of
UVvalues and a boolean that clarified their input/output role - Offload most initialization work to the lower
UVlayer - Expose an
update()method whose user callback received both an immutable input (&UV) and a mutable output (&mut UV)
For the GPU simulation, as discussed earlier, we will switch to a different architecture:
Concentrationswill now containInOuts instead ofUVsInOutinitialization will now be handled by theConcentrationslayer, as it is the one that has easy access to the output buffers of eachInOut- Initialization will now be asynchronous, as it entails some Vulkan commands that must be enqueued inside of a command buffer
- The
update()method will only receive a singleDescriptorSet, as this contains all info needed to read inputs and write outputs
The switch to InOut is straightforward enough, and probably not worth
discussing…
/// Double-buffered chemical species concentration storage
pub struct Concentrations {
/// Compute pipeline input/output configurations
///
/// If we denote `(U1, V1, U2, V2)` the underlying storage buffers...
/// - The first "forward" configuration uses `(U1, V1)` as inputs and
/// `(U2, V2)` as outputs.
/// - The second "reverse" configuration uses `(U2, V2)` as inputs and
/// `(U1, V1)` as outputs.
///
/// By alternating between these two configurations, we can take as many
/// simulation steps as we need to, always using the output of the
/// simulation step N as the input of simulation step N+1.
inout_sets: [InOut; 2],
/// Truth that the second "revese" input/output configuration is active
reversed: bool,
}
…however the constructor change will be quite a bit more substantial:
use super::{
pipeline::Pipelines,
CommandBufferBuilder,
};
use std::num::NonZeroU32;
use vulkano::pipeline::PipelineBindPoint;
impl Concentrations {
/// Set up GPU data storage and schedule GPU buffer initialization
///
/// GPU buffers will only be initialized after the command buffer associated
/// with `cmdbuild` has been built and submitted for execution. Any work
/// that depends on their initial value must be scheduled afterwards.
pub fn create_and_schedule_init(
options: &RunnerOptions,
context: &Context,
pipelines: &Pipelines,
cmdbuild: &mut CommandBufferBuilder,
) -> Result<Self> {
// Allocate all GPU storage buffers used by the simulation
let [u1, v1, u2, v2] = InOut::allocate_buffers(options, context)?;
// Set up input/output configurations
let inout1 = InOut::new(
context,
&pipelines.layout,
u1.clone(),
v1.clone(),
u2.clone(),
v2.clone(),
)?;
let inout2 = InOut::new(context, &pipelines.layout, u2.clone(), v2.clone(), u1, v1)?;
// Schedule initialization using the second descriptor set. The output
// buffers of this descriptor set are the input buffers of the first
// descriptor set, which will be used first.
cmdbuild.bind_pipeline_compute(pipelines.init.clone())?;
cmdbuild.bind_descriptor_sets(
PipelineBindPoint::Compute,
pipelines.layout.clone(),
INOUT_SET,
inout2.descriptor_set.clone(),
)?;
let num_workgroups = |domain_size: usize, workgroup_size: NonZeroU32| {
padded(domain_size).div_ceil(workgroup_size.get() as usize) as u32
};
let padded_workgroups = [
num_workgroups(options.num_cols, options.pipeline.workgroup_cols),
num_workgroups(options.num_rows, options.pipeline.workgroup_rows),
1,
];
// SAFETY: GPU shader has been checked for absence of undefined behavior
// given a correct execution configuration, and this is one
unsafe {
cmdbuild.dispatch(padded_workgroups)?;
}
// Schedule zero-initialization of the edges of the first output.
//
// Only the edges need to be initialized. The values at the center of
// the dataset do not matter, as these buffers will serve as simulation
// outputs at least once (which will initialize their central values)
// before they serve as a simulation input.
//
// Here we initialize the entire buffer to zero, as the Vulkan
// implementation is likely to special-case this buffer-zeroing
// operation with a high-performance implementation.
cmdbuild.fill_buffer(u2.reinterpret(), 0)?;
cmdbuild.fill_buffer(v2.reinterpret(), 0)?;
// Once the command buffer is executed, everything will be ready
Ok(Self {
inout_sets: [inout1, inout2],
reversed: false,
})
}
// [ ... more methods coming up ... ]
}
As you can see, it is now the Concentrations constructor that is responsible
for allocating storage for the underlying InOut structs and assigning input
and output roles to it.
The initialization process also becomes a bit more complex:
- The true simulation input is initialized using the
initcompute pipeline introduced earlier. - The other set of
(U, V)buffers must now be initialized because we now use a zero-padding scheme to handle simulation domain edges. We do this by filling these buffers with zeroes, which is overkill but likely to be optimized in hardware. - Both of these operations are asynchronous Vulkan commands, so usage of the
Concentrationsis now a bit more complex as it now builds a command buffer that must be submitted to the GPU and executed. We acknowledge this by switching from the standardnew()constructor naming to a more complexcreate_and_schedule_init()name that highlights what the user of this function needs to do.
When it comes to accessors, shape() will be dropped as it cannot be easily
provided by our 1D GPU storage without keeping otherwise unnecessary metadata
around. But the current() accessor will trivially be migrated to the new
logic. And for reasons that will become clear later on, it can also become a
private implementation detail of the underlying data module.
impl Concentrations {
// [ ... ]
/// Current input/output configuration
fn current_inout(&self) -> &InOut {
&self.inout_sets[self.reversed as usize]
}
// [ ... ]
}
On its side, the update() operation will be easily migrated to the new logic
discussed above as well, as it is largely a simplification with respect to its
former implementation:
impl Concentrations {
// [ ... ]
/// Run a simulation step
///
/// The `step` callback will be provided with the descriptor set that should
/// be used for the next simulation step. If you need to carry out multiple
/// simulation steps, you should call `update()` once per simulation step.
pub fn update(&mut self, step: impl FnOnce(Arc<DescriptorSet>) -> Result<()>) -> Result<()> {
step(self.current_inout().descriptor_set.clone())?;
self.reversed = !self.reversed;
Ok(())
}
}
There is just one new thing that we will need for GPU computing, which is the
ability to report errors from GPU programs. This is handled by making the inner
step callback return a Result<()>.
Output retrieval & storage
While InOut and Concentrations are enough for the purpose of setting up the
simulation and running simulation steps, we are going to need one more thing for
the purpose of retrieving GPU output on the CPU side. Namely a Vulkan buffer
that the CPU can access.
We could adapt the old UV struct for this purpose, but if you pay attention to how
the simulation output is actually used, you will notice that the io module
only writes the V species’ concentration to the HDF5 file. And while passing
an entire UV struct to this module anyway was fine when direct data access was
possible, it is becoming wasteful if we now need to perform an expensive
GPU-to-CPU transfer of the full (U, V) dataset only to use the V part
exclusively later on.
Therefore, our new VBuffer abstraction will focus on retrieval of the V species’
concentration only.
The construction code is quite similar to the one seen before in
InOut::allocate_buffers() (and in fact could be deduplicated with respect to
it in a more production-grade codebase). The only thing that changed is that the
BufferUsage
and
AllocationCreateInfo
have been adjusted to make this buffer fit for the purpose of downloading data
to the CPU:
/// CPU-accessible storage buffer used to download the V species' concentration
pub struct VBuffer {
/// Buffer in which GPU data will be downloaded
buffer: Subbuffer<[Float]>,
/// Number of columns in the 2D concentration table, including zero padding
padded_cols: usize,
}
//
impl VBuffer {
/// Set up a `VBuffer`
pub fn new(options: &RunnerOptions, context: &Context) -> Result<Self> {
use vulkano::memory::allocator::MemoryTypeFilter as MTFilter;
let padded_rows = padded(options.num_rows);
let padded_cols = padded(options.num_cols);
let buffer = Buffer::new_slice(
context.mem_allocator.clone(),
BufferCreateInfo {
usage: BufferUsage::TRANSFER_DST,
..Default::default()
},
AllocationCreateInfo {
memory_type_filter: MTFilter::PREFER_HOST | MTFilter::HOST_RANDOM_ACCESS,
..Default::default()
},
(padded_rows * padded_cols) as DeviceSize,
)?;
Ok(Self {
buffer,
padded_cols,
})
}
// [ ... more methods coming up ... ]
}
After that, we can add a method to schedule a GPU-to-CPU data transfer…
use vulkano::command_buffer::CopyBufferInfo;
impl VBuffer {
// [ ... ]
/// Schedule a download of some [`Concentrations`]' current V input into
/// the internal CPU-accessible buffer of this `VBuffer`
///
/// The GPU-to-CPU download will only begin after the command buffer
/// associated with `cmdbuild` has been built and submitted to the GPU for
/// execution. You must wait for the associated GPU work to complete before
/// processing the output with the [`process()`](Self::process) method.
pub fn schedule_download(
&mut self,
source: &Concentrations,
cmdbuild: &mut CommandBufferBuilder,
) -> Result<()> {
cmdbuild.copy_buffer(CopyBufferInfo::buffers(
source.current_inout().input_v.clone(),
self.buffer.clone(),
))?;
Ok(())
}
// [ ... ]
}
…and there is just one last piece to take care of, which is to provide a way to access the inner data after the download is complete. Which will require a bit more work than you may expect.
To set the stage, let’s point out that we are trying to set up some communication between two Rust libraries with the following API design.
- To avoid data races between the CPU and the GPU,
vulkanoenforces an RAII design where accesses to aSubbuffermust go through theSubbuffer::read()method. This method returns aBufferReadGuardthat borrows from the underlyingSubbufferand letsvulkanoknow at destruction time that it is not being accessed by the CPU anymore. Under the hood, locks and checks are then used to achieve thread safety. - We start from this
BufferReadGuard, which borrows memory from the underlyingSubbufferstorage like a standard Rust slice of type&[Float]could borrow from aVec<Float>. And we want to add 2D layout information in order to turn it into anndarray::ArrayView2<Float>, which is what the HDF5 binding that we are using ultimately expects.
Now, because the VBuffer type that we are building is logically a 2D array, it
would be good API design from our side to refrain from exposing the underlying
1D vulkano dataset in the VBuffer API and instead only provide users with the
ArrayView2 that they need for HDF5 I/O and other operations. While we are at
it, we would also rather not expose the zero padding elements to the user, as
they won’t be part of the final HDF5 file and are arguably an implementation
detail of our current Gray-Scott simulation implementation.
We can get all of those good things, as it turns out, but the simplest way for us to get there2 will be a somewhat weird callback-based interface:
use ndarray::prelude::*;
impl VBuffer {
// [ ... ]
/// Process the latest download of the V species' concentrations
///
/// Before calling this method, you will want to [schedule a
/// download](Self::schedule_download), submit the resulting command buffer,
/// and await its completion.
///
/// The provided V species concentration table will only contain active
/// elements, excluding zero-padding elements on the edge.
pub fn process(&self, callback: impl FnOnce(ArrayView2<Float>) -> Result<()>) -> Result<()> {
// Access the underlying dataset as a 1D slice
let read_guard = self.buffer.read()?;
// Create an ArrayView2 that covers the whole data, padding included
let padded_cols = self.padded_cols;
let padded_elements = read_guard.len();
assert_eq!(padded_elements % padded_cols, 0);
let padded_rows = padded_elements / padded_cols;
let padded_view = ArrayView::from_shape([padded_rows, padded_cols], &read_guard)?;
// Extract the central region of padded_view, excluding padding
let data_view = padded_view.slice(s!(
PADDING_PER_SIDE..(padded_rows - PADDING_PER_SIDE),
PADDING_PER_SIDE..(padded_cols - PADDING_PER_SIDE),
));
// We are now ready to run the user callback
callback(data_view)
}
}
The general idea here is that a user who wants to read the contents of the
buffer will pass us a function (typically a lambda) that takes the current
contents of the buffer (as an un-padded ArrayView2) and returns a
Result<()> that tells if the operation is successful.
On our side, we will then do everything needed to set up the two-dimensional array view, pass it to the user-specified callback function, and return the result.
HDF5 I/O refactor
As mentioned earlier, one last thing that should change with respect to our former CPU code is that we want our HDF5 I/O module to be clearer about what it wants.
Indeed, at present time, HDF5Writer::write() demands a full set of (U, V)
data of which it only uses the V concentration data. This was fine from a CPU
programming perspective where we don’t pay for exposing unused data access
opportunities. But from a GPU programming perspective it means downloading U
concentration data that the HDF5 I/O module is not going to use.
We will fix this by making the HDF5Writer more explicit about what it wants,
and having it take the V species concentration only instead.
// In exercises/src/grayscott/io.rs
use ndarray::ArrayView2;
impl HDF5Writer {
// [ ... ]
/// Write a new V species concentration table to the file
pub fn write(&mut self, v: ArrayView2<Float>) -> hdf5::Result<()> {
// FIXME: Workaround for an HDF5 binding limitation
let v = v.to_owned();
self.dataset.write_slice(&v, (self.position, .., ..))?;
self.position += 1;
Ok(())
}
// [ ... ]
}
Notice the FIXME above. Apparently, the Rust HDF5 binding we are using does not
yet handle ArrayView2s whose rows are not contiguous in memory, which means
that we must create a contiguous copy of v before it accepts to write it to a
file.
From the author’s understanding of the HDF5 C API, it can handle this, and this is a Rust binding limitation that should be fixed. But until a fix happens, making an owned contiguous copy should be a reasonably efficient workaround, as for typical storage devices in-RAM copies are much faster than writing data to the target storage device.
As an alternative, we could also modify our GPU-to-CPU copy logic so that it does not copy the padding zero elements, saving a bit of CPU-GPU interconnect bandwidth along the way. However this will require us to stop using standard Vulkan copy commands and use custom shaders for this purpose instead, which may in turn cause two issues:
- Performance may be worse, because the standard Vulkan copy command should have been well-optimized by the GPU vendor. Our shader would need to be optimized similarly for all GPU devices on which we want to perform well, which is a lot of work.
- We will very likely lose the ability to overlap GPU-to-CPU copies with computations, which we are not using yet but may want to use later as an optimization.
As always, tradeoffs are the name of the game in engineering… but as you will see later, this particular tradeoff is going to disappear once we introduce other optimizations anyway.
Exercise
In the data module of the Gray-Scott reaction simulation
(exercises/src/grayscott/data.rs), replace the UV and Concentrations
structs with the InOut, Concentrations and VBuffer types introduced in
this chapter.
After that is done, proceed to modify the io module of the simulation so that it works with borrowed V concentration data only, as discussed above.
You will find that the simulation does not compile at this point. This is
expected because the run_simulation() and update() function of the
simulation library have not been updated yet, and the CommandBufferBuilder
type alias has not been defined yet either. We will fix that in the next
chapter, for now just make sure that there is no compilation error originating
from a mistake in data.rs or io.rs.
-
Without losing the benefits of GLSL’s
readonlyandwriteonlyqualifiers and introducing new Vulkan concepts like push constants, that is. ↩ -
It is possible to write a callback-free
read()method that returns an object that behaves like anArrayView2, but implementing it efficiently (without recreating theArrayView2on every access) involves building a type that is self-referential in the eyes of the Rust’s compiler lifetime analysis. Which means that some dirtyunsafetricks will be required. ↩