Integration
After a long journey, we are once again reaching the last mile where we almost have a complete Gray-Scott reaction simulation. In this chapter, we will proceed to walk this last mile and get everything working again, on GPU this time.
Simulation commands
In the previous chapter, we have attempted to increase separation of concerns across the simulation codebase so that one function is not responsible for all command buffer manipulation work.
Thanks to this work, we can have a simulation scheduling function that is a lot
less complex than the build_command_buffer()
function we used to have in our
number-squaring program:
use self::{
data::Concentrations,
options::Options,
pipeline::{Pipelines, DATA_SET},
};
use crate::Result;
use vulkano::{
command_buffer::auto::{AutoCommandBufferBuilder, PrimaryAutoCommandBuffer},
pipeline::PipelineBindPoint,
};
/// Record the commands needed to run a bunch of simulation iterations
pub fn schedule_simulation(
options: &Options,
pipelines: &Pipelines,
cmdbuf: &mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
concentrations: &mut Concentrations,
) -> Result<()> {
// Determine the appropriate workgroup size for the simulation
let simulate_workgroups = [
options
.runner
.num_cols
.div_ceil(options.pipeline.workgroup_cols.get() as usize) as u32,
options
.runner
.num_rows
.div_ceil(options.pipeline.workgroup_rows.get() as usize) as u32,
1,
];
// Schedule the requested number of simulation steps
cmdbuf.bind_pipeline_compute(pipelines.main.clone())?;
for _ in 0..options.runner.compute_steps_per_output_step {
concentrations.update(|uvset| {
cmdbuf.bind_descriptor_sets(
PipelineBindPoint::Compute,
pipelines.layout.clone(),
DATA_SET,
uvset.descriptor_set.clone(),
)?;
// SAFETY: GPU shader has been checked for absence of undefined behavior
// given a correct execution configuration, and this is one
unsafe {
cmdbuf.dispatch(simulate_workgroups)?;
}
Ok(())
})?;
}
Ok(())
}
There are a few things worth pointing out here:
- Unlike our former
build_command_buffer()
function, this function does not build its own command buffer, but only adds extra commands to a caller-allocated existing command buffer. This will allow us to handle data initialization more elegantly later. - We are computing the compute pipeline dispatch size on each run of this function, which depending on compiler optimizations may or may not result in redundant work. The quantitative overhead of this work should be so small compared to everything else in this function, however, that we do not expect this small inefficiency to matter. But we will check this when the time comes to profile our program’s CPU utilization.
- We are enqueuing an unbounded amount of commands to our command buffer here, and the GPU will not start executing work until we are done building and submitting the associated command buffer. As we will later see in this course’s optimization section, this can become a problem in unusual execution configurations where thousands of simulations steps occur between each generated image. The way to fix this problem will be discussed in the corresponding course chapter, after taking care of higher-priority optimizations.
Simulation runner
With our last utility function written down, it is time to tackle the meat of
the issue, and adapt our formerly CPU-centric run_simulation()
utility so that
it can run the GPU computation.
And while we are at it, we will also fix another design issue of our former CPU code, which is that we needed to duplicate a lot of logic between out simulation binary and our microbenchmark.
As this logic is getting more complicated in our GPU version, this is becoming a
more pressing problem. So we will fix it, at the expense of reducing our
benchmark’s level of detail, by generalizing run_simulation()
so that it is
useful for microbenchmarking in addition to regular simulation:
use self::data::UVSet;
use crate::context::Context;
use vulkano::{
command_buffer::{CommandBufferUsage, PrimaryCommandBufferAbstract},
sync::GpuFuture,
};
/// Simulation runner, with a user-configurable hook to...
///
/// - Schedule extra work in the command buffer where the simulation steps are
/// being recorded, knowing the final simulation state.
/// - Perform extra work after the GPU is done executing work.
pub fn run_simulation(
options: &Options,
context: &Context,
mut schedule_after_simulation: impl FnMut(
&UVSet,
&mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
) -> Result<()>,
mut after_gpu_wait: impl FnMut() -> Result<()>,
) -> Result<()> {
// Set up the GPU compute pipelines
let pipelines = Pipelines::new(context, options)?;
// Set up the initial command buffer
let new_cmdbuf = || {
AutoCommandBufferBuilder::primary(
context.comm_allocator.clone(),
context.queue.queue_family_index(),
CommandBufferUsage::OneTimeSubmit,
)
};
let mut cmdbuf = new_cmdbuf()?;
// Set up the concentrations storage and schedule initialization
let mut concentrations =
Concentrations::create_and_schedule_init(options, context, &pipelines, &mut cmdbuf)?;
// Produce the requested amount of concentration tables
for _ in 0..options.runner.num_output_steps {
// Schedule some simulation steps
schedule_simulation(options, &pipelines, &mut cmdbuf, &mut concentrations)?;
// Schedule any other user-requested work after the simulation
schedule_after_simulation(concentrations.current(), &mut cmdbuf)?;
// Submit the work to the GPU and wait for it to execute
cmdbuf
.build()?
.execute(context.queue.clone())?
.then_signal_fence_and_flush()?
.wait(None)?;
// Perform operations after the GPU is done
after_gpu_wait()?;
// Set up the next command buffer
cmdbuf = new_cmdbuf()?;
}
Ok(())
}
Here is a summary of changes with respect to the previous version of
run_simulation()
:
- HDF5 I/O concerns are not handled by
run_simulation()
anymore. This concern is now offloaded to the caller, which can handle it using a pair of new user-defined hooks:1schedule_after_simulation()
is called after the simulation engine is done filling up a command buffer with simulation pipeline executions. It lets the caller add GPU commands to e.g. download the final simulation state from the CPU to the GPU.after_gpu_wait()
is called after waiting for the GPU to be done. It lets the caller e.g. read the final CPU copy of the concentration of V and save it to disk.
- The
update()
hook is gone, as in GPU programs there are fewer opportunities than in CPU programs to optimize the simulation update logic without modifying other aspects of the simulation like the data management, so this configurability does not pull its weight. - The caller is now expected to pass in a pre-initialized GPU context.
- The result type is more general than before (where it used to be HDF5-specific) to account for the new possibility of GPU API errors.
- GPU compute pipelines and command buffers must now be set up, and command buffers must also be submitted to the GPU and awaited.
- The simulation runner does not manage individual simulation steps anymore, as
on a GPU this would have unbearable synchronization costs. Instead, simulation
steps are executed as batches of size
compute_steps_per_output_step
.
And with that, we are done with the parts of the simulation logic that are
shared between the main binary and the microbenchmark, so you can basically
replace the entire contents of exercises/src/grayscott/mod.rs
with the above
two functions.
Main simulation binary
Now that we have altered the API contract of run_simulation()
, we also need to
rewrite much of the main simulation binary accordingly. The part up to Vulkan
context setup remains the same, but then we need to do this:
use grayscott_exercises::grayscott::{data::VBuffer, io::HDF5Writer};
fn main() -> Result<()> {
// [ ... parse CLI options, set up progress bar & Vulkan context ... ]
// Set up the CPU buffer for concentrations download
let vbuffer = RefCell::new(VBuffer::new(&options, &context)?);
// Set up the HDF5 file output
let mut hdf5 = HDF5Writer::create(
&options.runner.file_name,
[options.runner.num_rows, options.runner.num_cols],
options.runner.num_output_steps,
)?;
// Run the simulation
grayscott::run_simulation(
&options,
&context,
|uv, cmdbuf| {
// Schedule a download of the final simulation state
vbuffer.borrow_mut().schedule_update(uv, cmdbuf)
},
|| {
// Write down the current simulation output
vbuffer.borrow().read_and_process(|v| hdf5.write(v))??;
// Record that progress has been made
progress.inc(options.runner.compute_steps_per_output_step as u64);
Ok(())
},
)?;
// Close the HDF5 file with proper error handling
hdf5.close()?;
// Declare the computation finished
progress.finish();
Ok(())
}
What is new here?
- We need to set up a CPU buffer to download our GPU data into. And because we
are using a
run_simulation()
design with two hooks that both use thisvbuffer
(see footnote1), the Rust compiler’s static lifetime analysis gets overwhelmed and we need to switch to dynamic lifetime analysis (RefCell
) to work around it. - Because HDF5 I/O is now the responsibility of the
simulate
binary, we take care of it here. - We leverage the two hooks provided by
run_simulation()
for their intended purpose: to download GPU results to the CPU, save them to the HDF5 file, and record that progress has been made in our progress bar.
Exercises
Integrate the above code into the main simulation binary
(exercises/src/bin/simulate.rs
), then…
- Do a simulation test run (
cargo run --release -- -n100
) - Use
mkdir -p pics && data-to-pics -o pics
to convert the output data into PNG images - Use your favorite image viewer to check that the resulting images look about right
Beyond that, the simulate
benchmark (exercises/benches/simulate.rs
) has been
pre-written for you in order to exercise the final simulation engine in various
configurations. Check out the code to get a general idea of how it works, then
run it for a while (cargo bench --bench simulate
) and see how the various
tunable parameters affect performance.
Do not forget that you can also pass in a regular expression argument (as in
e.g. cargo bench --bench simulate -- '2048x.*32steps.*compute$'
) in order to
only benchmark specific configurations.