Data & I/O
As before, after setting up our GPU compute pipelines, we will want to set up some data buffers that we can bind to those pipelines.
This process will be quite a bit simpler than before because we will not repeat the introduction to Vulkan memory management and will be using GPU-side initialization. So we will use the resulting savings in character budget to…
- Show what it takes to integrate GPU data into our existing CPU simulation skeleton.
- Follow the suggestion made in the previous execution chapter to break down the
build_command_buffer()
god-function that we used to have into multiple functions that each add a smaller amount of work to a command buffer. - Adjust our HDF5 I/O logic so that we do not need to retrieve U concentration data from the GPU that we do not actually need.
GPU dataset
New code organization
The point of Vulkan descriptor sets is to allow your application to use as few of them as possible in order to reduce resource binding overhead. In the context of our Gray-Scott simulation, the lowest we can easily1 achieve is to have two descriptor sets.
- One that uses two buffers (let’s call them U1 and V1) as inputs and two other buffers (let’s call them U2 and V2) as outputs.
- Another that uses the same buffers, but flips the roles of input and output buffers. Using the above notation, U2 and U2 become the inputs, while U1 and U1 become the outputs.
Given this descriptor set usage scheme, our command buffers will alternatively bind these two descriptor sets, executing the simulation compute pipeline after each descriptor set binding call, and this will roughly replicate the double buffering pattern that we used on the CPU.
To get there, however, we will need to redesign our inner data abstractions a bit with respect to what we used to have on the CPU side. Indeed, back in the CPU course, we used to have the following separation of concerns in our code:
- One
struct
calledUV
would represent a pair of tables of identical size and related contents, one representing the chemical concentration of speciesU
and one representing the chemincal concentration of speciesV
. - Another
struct
calledConcentrations
would represent a pair ofUV
structs and implement the double buffering logic for alternatively using one of theseUV
structs to store inputs, and the other to store outputs.
But now that we have descriptor sets that combine inputs and outputs, this program decomposition scheme doesn’t work anymore. Which is why we will have to switch to a different scheme:
- One
struct
calledUVSet
will contain and manage allvulkano
objects associated to one(U, V)
input pair and one(U, V)
output pair. struct Concentrations
will remain around, but be repurposed to manipulate pairs ofUVSet
rather than pairs ofUV
. And users of itsupdate()
function will now only be exposed to a singleUVSet
, instead of being exposed to a pair ofUV
s.
Introducing UVSet
Our new UVSet
data structure is going to look like this:
use std::sync::Arc;
use vulkano::{buffer::subbuffer::Subbuffer, descriptor_set::DescriptorSet};
// Throughout this module, we will model (U, V) pairs as arrays of two values
// of identical type with the following indexing convention
const U: usize = 0;
const V: usize = 1;
/// GPU-side input and output (U, V) concentration table pairs.
pub struct UVSet {
/// Descriptor set used by GPU compute pipelines
pub descriptor_set: Arc<DescriptorSet>,
/// Input buffers from `set`, used during GPU-to-CPU data transfers
input_uv: [Subbuffer<[Float]>; 2],
}
As the comments point out, we are going to keep both the descriptor set and the
(U, V)
buffer pair around, because they are useful for different tasks:
- Compute pipeline execution commands operate over descriptor sets
- Buffer-to-buffer data transfer commands operate over the underlying
Subbuffer
objects - Because descriptor sets are a very general-purpose abstraction, going from a descriptor set to the underlying buffer objects is a rather cumbersome process.
- And because
Subbuffer
is jute a reference-counted pointer, it does not cost us much to skip that cumbersome process by keeping around a direct reference to the underlying buffer.
Now, let us look at how an UVSet
is actually set up:
use super::{
options::Options,
pipeline::{DATA_SET, IN, OUT},
};
use crate::{context::Context, Result};
use vulkano::{
buffer::{Buffer, BufferCreateInfo, BufferUsage},
descriptor_set::WriteDescriptorSet,
memory::allocator::AllocationCreateInfo,
pipeline::layout::PipelineLayout,
DeviceSize,
};
/// Number of padding elements per side of the simulation domain
const PADDING_PER_SIDE: usize = 1;
impl UVSet {
/// Allocate a set of 4 buffers that can be used to store either U and V
/// species and can serve as an input or output.
fn allocate_buffers(options: &Options, context: &Context) -> Result<Box<[Subbuffer<[Float]>]>> {
let padded_rows = options.runner.num_rows + 2 * PADDING_PER_SIDE;
let padded_cols = options.runner.num_cols + 2 * PADDING_PER_SIDE;
let buffers = std::iter::repeat_with(|| {
Buffer::new_slice(
context.mem_allocator.clone(),
BufferCreateInfo {
usage: BU::STORAGE_BUFFER | BU::TRANSFER_DST | BU::TRANSFER_SRC,
..Default::default()
},
AllocationCreateInfo::default(),
(padded_rows * padded_cols) as DeviceSize,
)
})
.take(4)
.collect::<std::result::Result<Box<[_]>, _>>()?;
Ok(buffers)
}
/// Set up an `UVSet` by assigning roles to the 4 buffers that
/// `allocate_buffers()` previously allocated.
fn new(
context: &Context,
layout: &PipelineLayout,
in_u: Subbuffer<[Float]>,
in_v: Subbuffer<[Float]>,
out_u: Subbuffer<[Float]>,
out_v: Subbuffer<[Float]>,
) -> Result<Self> {
// Configure which pipeline descriptor set this will bind to
let set_layout = layout.set_layouts()[DATA_SET as usize].clone();
// Configure what resources will attach to the various bindings
// that this descriptor set is composed of
let descriptor_writes = [
WriteDescriptorSet::buffer_array(IN, 0, [in_u.clone(), in_v.clone()]),
WriteDescriptorSet::buffer_array(OUT, 0, [out_u.clone(), out_v.clone()]),
];
// Set up the descriptor set accordingly
let descriptor_set = DescriptorSet::new(
context.desc_allocator.clone(),
set_layout,
descriptor_writes,
[],
)?;
// Put it all together
Ok(Self {
descriptor_set,
input_uv: [in_u, in_v],
})
}
}
The general idea here is that because our two UVSet
s will refer to the same
buffers, we cannot allocate the buffers internally inside of the UVSet::new()
constructor. Instead we will need to allocate buffers inside of the code from
Concentrations
that builds UVSet
s, then use the same buffers twice in a
different order to build the two different UVset
s.
Obviously, is not so nice from an abstraction design point of view that the
caller needs to know about such a thing as the right order in which buffers
should be passed. But sadly this cannot be cleanly fixed at the UVSet
layer,
so we will fix it at the Concentrations
layer instead.
Updating Concentrations
In the CPU simulation, the Concentrations
struct used to…
- …contain a pair of
UV
values and a boolean that clarified their input/output role, - …offload most initialization work to the lower
UV
layer, - …and expose an
update()
method whose user callback received both an immutable input (&UV
) and a mutable output (&mut UV
).
For the GPU simulation, we will change this as follows:
Concentrations
will now containUVSet
s instead ofUV
s.UVSet
initialization will now be handled by theConcentrations
layer, as it is the one that has easy access to the output buffers of eachUVSet
.- The
update()
method will only receive a single&UVSet
, as this contains all info needed to read inputs and write outputs.
The switch to UVSet
is straightforward enough, and probably not worth
discussing…
pub struct Concentrations {
sets: [UVSet; 2],
src_is_1: bool,
}
…however the constructor change will obviously be quite a bit more substantial:
use super::pipeline::Pipelines;
use vulkano::{
command_buffer::auto::{AutoCommandBufferBuilder, PrimaryAutoCommandBuffer},
pipeline::PipelineBindPoint,
};
impl Concentrations {
/// Set up the GPU simulation state and schedule GPU buffer initialization
pub fn create_and_schedule_init(
options: &Options,
context: &Context,
pipelines: &Pipelines,
cmdbuf: &mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
) -> Result<Self> {
// Allocate all GPU buffers
let [u1, v1, u2, v2] = &UVSet::allocate_buffers(options, context)?[..] else {
panic!("Unexpected number of data buffers")
};
// Set up the associated UV sets
let set1 = UVSet::new(
context,
&pipelines.layout,
u1.clone(),
v1.clone(),
u2.clone(),
v2.clone(),
)?;
let set2 = UVSet::new(
context,
&pipelines.layout,
u2.clone(),
v2.clone(),
u1.clone(),
v1.clone(),
)?;
// Schedule the initialization of the second output, which is the first
// input, and therefore the overall simulation input.
cmdbuf.bind_pipeline_compute(pipelines.init.clone())?;
cmdbuf.bind_descriptor_sets(
PipelineBindPoint::Compute,
pipelines.layout.clone(),
DATA_SET,
set2.descriptor_set.clone(),
)?;
let padded_workgroups = [
(options.runner.num_cols + 2 * PADDING_PER_SIDE)
.div_ceil(options.pipeline.workgroup_cols.get() as usize) as u32,
(options.runner.num_rows + 2 * PADDING_PER_SIDE)
.div_ceil(options.pipeline.workgroup_rows.get() as usize) as u32,
1,
];
// SAFETY: GPU shader has been checked for absence of undefined behavior
// given a correct execution configuration, and this is one
unsafe {
cmdbuf.dispatch(padded_workgroups)?;
}
// Schedule the zero-initialization of the edges of the first output.
// The center of it will be overwritten by the first simulation step,
// so it can have any value we like, therefore it can be zeroed as well.
cmdbuf.fill_buffer(u2.clone().reinterpret(), 0)?;
cmdbuf.fill_buffer(v2.clone().reinterpret(), 0)?;
// Once cmdbuf is done initializing, we will be done
Ok(Self {
sets: [set1, set2],
src_is_1: false,
})
}
// [ ... more methods coming up ... ]
}
The shape()
accessor will be dropped, as it cannot be easily provided by our
GPU storage without keeping otherwise unnecessary metadata around, but the
current()
accessor will trivially be migrated to the new logic…
impl Concentrations {
// [ ... ]
/// Read out the current species concentrations
pub fn current(&self) -> &UVSet {
&self.sets[self.src_is_1 as usize]
}
// [ ... ]
}
…and the update()
operation will be easily migrated to the new logic
discussed above as well, as it is largely a simplification with respect to its
former implementation. There is just one new thing that we will need for GPU
computing, which is the ability to report errors from GPU programs.
impl Concentrations {
// [ ... ]
/// Run a simulation step
pub fn update(&mut self, step: impl FnOnce(&UVSet) -> Result<()>) -> Result<()> {
step(self.current())?;
self.src_is_1 = !self.src_is_1;
Ok(())
}
}
Output retrieval & storage
While UVSet
and Concentrations
are enough for the purpose of setting up the
simulation and running simulation steps, we are going to need one more thing for
the purpose of retrieving GPU output on the CPU side. Namely a Vulkan buffer
that the CPU can access.
We could reuse the UV
struct for this purpose, but if you pay attention to how
the simulation output is actually used, you will notice that the io
module
only writes the V
species’ concentration to the HDF5 file. And while passing
an entire UV
struct to this module anyway was fine when direct data access was
possible, it is becoming wasteful if we now need to perform an expensive
GPU-to-CPU transfer of the full (U, V)
dataset only to use the V
part
exclusively later on.
Therefore, our new VBuffer
abstraction will focus on retrieval of the V species’
concentration only, until a new use case comes up someday where the U species’
concentration becomes useful too.
The construction code is quite similar to the one seen before in
UVSet::allocate_buffers()
(and in fact should probably be deduplicated with
respect to it in a more production-grade codebase). The only thing that changed
is that the
BufferUsage
and
AllocationCreateInfo
have been adjusted to make this buffer fit for the purpose of downloading data
to the CPU:
/// CPU-accessible storage buffer containing the V species' concentration
pub struct VBuffer {
buffer: Subbuffer<[Float]>,
padded_cols: usize,
}
//
impl VBuffer {
/// Set up a `VBuffer`
pub fn new(options: &Options, context: &Context) -> Result<Self> {
let padded_rows = options.runner.num_rows + 2 * PADDING_PER_SIDE;
let padded_cols = options.runner.num_cols + 2 * PADDING_PER_SIDE;
use vulkano::memory::allocator::MemoryTypeFilter as MTFilter;
let buffer = Buffer::new_slice(
context.mem_allocator.clone(),
BufferCreateInfo {
usage: BufferUsage::TRANSFER_DST,
..Default::default()
},
AllocationCreateInfo {
memory_type_filter: MTFilter::PREFER_HOST | MTFilter::HOST_RANDOM_ACCESS,
..Default::default()
},
(padded_rows * padded_cols) as DeviceSize,
)?;
Ok(Self {
buffer,
padded_cols,
})
}
// [ ... more methods coming up ... ]
}
After that, we can add a method to schedule a GPU-to-CPU data transfer…
use vulkano::command_buffer::CopyBufferInfo;
impl VBuffer {
// [ ... ]
/// Schedule an update of this buffer from an `UVSet`'s current input
pub fn schedule_update(
&mut self,
cmdbuf: &mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
source: &UVSet,
) -> Result<()> {
cmdbuf.copy_buffer(CopyBufferInfo::buffers(
source.input_uv[V].clone(),
self.buffer.clone(),
))?;
Ok(())
}
// [ ... ]
}
…and there is just one last piece to take care of, which is to provide a way to access the inner data. Which will require a bit more work than you may expect.
To set the stage, let’s point out that we are trying to set up some communication between two Rust libraries with the following API design.
- To achieve memory safety in the presence of a risk of data races between the
CPU and the GPU,
vulkano
enforces an RAII design where accesses to aSubbuffer
must go through theSubbuffer::read()
method. This method returns aBufferReadGuard
that borrows from the underlyingSubbuffer
and letsvulkano
know at destruction time that it is not being accessed by the CPU anymore. Under the hood, locking and checks are then used to achieve safety. - Starting from this
BufferReadGuard
, which borrows memory from the underlyingSubbuffer
storage like a standard Rust slice of type&[Float]
could borrow from aVec<Float>
, we want to add 2D layout information to it in order to turn it into anndarray::ArrayView2<Float>
, which is what the HDF5 binding that we are using ultimately expects.
Now, because the VBuffer
type that we are building is logically a 2D array, it
would be good API design from our side to refrain from exposing the underlying
1D dataset in the VBuffer
API and instead only provide users with the
ArrayView2
that they need for HDF5 I/O and other operations. While we are at
it, we would also rather not expose the zero padding elements to the user, as
they won’t be part of the final HDF5 file and are arguably an implementation
detail of our current Gray-Scott simulation implementation.
We can get all of those good things, as it turns out, but the simplest way for us to get there2 will be a somewhat weird callback-based interface:
use ndarray::prelude::*;
impl VBuffer {
// [ ... ]
/// Access the inner V species concentration as a 2D array without padding
///
/// Before calling this method, you will generally want to schedule an
/// update, submit the resulting command buffer, and await its completion.
pub fn read_and_process<R>(&self, callback: impl FnOnce(ArrayView2<Float>) -> R) -> Result<R> {
// Access the underlying dataset as a 1D slice
let read_guard = self.buffer.read()?;
// Create an ArrayView2 that covers the whole data, padding included
let padded_cols = self.padded_cols;
let padded_elements = read_guard.len();
assert_eq!(padded_elements % padded_cols, 0);
let padded_rows = padded_elements / padded_cols;
let padded_view = ArrayView::from_shape([padded_rows, padded_cols], &read_guard)?;
// Extract the central region of padded_view, excluding padding
let data_view = padded_view.slice(s!(
PADDING_PER_SIDE..(padded_rows - PADDING_PER_SIDE),
PADDING_PER_SIDE..(padded_cols - PADDING_PER_SIDE),
));
// We're now ready to run the user callback
Ok(callback(data_view))
}
}
The general idea here is that a user who wants to read the contents of the
buffer will pass us a function (typically a lambda) that takes the current
contents of the buffer (as an un-padded ArrayView2
) and returns a result of
an arbitrary type R
that we do not care about. On our side, we will then
proceed to do everything needed to set up the two-dimensional array view, call
the user-specified function, and return the result.
HDF5 I/O refactor
As mentioned earlier, one last thing that should change with respect to our former CPU code is that we want our HDF5 I/O module to be clearer about what it wants.
Indeed, at present time, HDF5Writer::write()
demands a full set of (U, V)
data of which it only uses the V
concentration data. This was fine from a CPU
programming perspective where we don’t pay for exposing unused data access
opportunities, but from a GPU programming perspective it means downloading U
concentration data that the HDF5 I/O module is not going to use.
We will fix this by making the HDF5Writer
more explicit about what it wants,
and having it take the V
species concentration only instead.
// In exercises/src/grayscott/io.rs
use ndarray::ArrayView2;
impl HDF5Writer {
// [ ... ]
/// Write a new V species concentration table to the file
pub fn write(&mut self, v: ArrayView2<Float>) -> hdf5::Result<()> {
// FIXME: Workaround for an HDF5 binding limitation
let v = v.to_owned();
self.dataset.write_slice(&v, (self.position, .., ..))?;
self.position += 1;
Ok(())
}
// [ ... ]
}
Notice the FIXME above. Apparently, the Rust HDF5 binding we are using does not
yet handle ArrayView2
s whose rows are not contiguous in memory, which means
that we must create a contiguous copy of v
before it accepts to write it to a
file.
From the author’s memory of the HDF5 C API, it can handle this, and this limitation is specific to the Rust bindings that should be fixed. Until a fix happens, however, making an owned contiguous copy should be a reasonably efficient workaround, as for typical storage devices in-RAM copies are much faster than writing data to the target storage device.
As an alternative, we could also modify our GPU-to-CPU copy logic so that it does not copy the padding zero elements, saving a bit of CPU-GPU interconnect bandwidth along the way. However this will require us to stop using standard Vulkan copy commands and use custom shaders for this purpose instead, which may in turn cause two issues:
- Performance may be worse, because the standard Vulkan copy command should have been well-optimized by the GPU vendor. Our shader would need to be optimized similarly.
- We will very likely lose the ability to overlap GPU-to-CPU copies with computations, which we are not using yet but may want to use later as an optimization.
As always, tradeoffs are the name of the game in engineering…
Exercise
In the data
module of the Gray-Scott reaction simulation
(exercises/src/grayscott/data.rs
), replace the UV
and Concentrations
structs with the UVSet
, Concentrations
and VBuffer
types introduced in
this chapter.
After that is done, proceed to modify the io
module of the simulation so that it works with borrowed V
concentration data only, as discussed above.
You will find that the simulation does not compile at this point. This is
expected because the run_simulation()
and update()
function of the
simulation library have not been updated yet. We will fix that in the next
chapter, for now just make sure that there is no compilation error originating
from a mistake in data.rs
or io.rs
.
-
Without losing the benefits of GLSL’s
readonly
andwriteonly
qualifiers and introducing new Vulkan concepts like push constants, that is. ↩ -
It is possible to write a callback-free
read()
method that returns an object that behaves like anArrayView2
, but implementing it efficiently (without recreating theArrayView2
on every access) involves building a type that is self-referential in the eyes of the Rust’s compiler lifetime analysis. Which means that some dirtyunsafe
tricks will be required. ↩