Integration

After a long journey, we are once again reaching the last mile where we almost have a complete Gray-Scott reaction simulation. In this chapter, we will proceed to walk this last mile and get everything working again, on GPU this time.

Simulation commands

In the previous chapter, we have attempted to increase separation of concerns across the simulation codebase so that one function is not responsible for all command buffer manipulation work.

Thanks to this work, we can have a simulation scheduling function that is a lot less complex than the build_command_buffer() function we used to have in our number-squaring program:

use self::{
    data::Concentrations,
    options::Options,
    pipeline::{Pipelines, DATA_SET},
};
use crate::Result;
use vulkano::{
    command_buffer::auto::{AutoCommandBufferBuilder, PrimaryAutoCommandBuffer},
    pipeline::PipelineBindPoint,
};

/// Record the commands needed to run a bunch of simulation iterations
pub fn schedule_simulation(
    options: &Options,
    pipelines: &Pipelines,
    cmdbuf: &mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
    concentrations: &mut Concentrations,
) -> Result<()> {
    // Determine the appropriate workgroup size for the simulation
    let simulate_workgroups = [
        options
            .runner
            .num_cols
            .div_ceil(options.pipeline.workgroup_cols.get() as usize) as u32,
        options
            .runner
            .num_rows
            .div_ceil(options.pipeline.workgroup_rows.get() as usize) as u32,
        1,
    ];

    // Schedule the requested number of simulation steps
    cmdbuf.bind_pipeline_compute(pipelines.main.clone())?;
    for _ in 0..options.runner.compute_steps_per_output_step {
        concentrations.update(|uvset| {
            cmdbuf.bind_descriptor_sets(
                PipelineBindPoint::Compute,
                pipelines.layout.clone(),
                DATA_SET,
                uvset.descriptor_set.clone(),
            )?;
            // SAFETY: GPU shader has been checked for absence of undefined behavior
            //         given a correct execution configuration, and this is one
            unsafe {
                cmdbuf.dispatch(simulate_workgroups)?;
            }
            Ok(())
        })?;
    }
    Ok(())
}

There are a few things worth pointing out here:

Unlike our former build_command_buffer() function, this function does not build its own command buffer, but only adds extra commands to a caller-allocated existing command buffer. This will allow us to handle data initialization more elegantly later.
We are computing the compute pipeline dispatch size on each run of this function, which depending on compiler optimizations may or may not result in redundant work. The quantitative overhead of this work should be so small compared to everything else in this function, however, that we do not expect this small inefficiency to matter. But we will check this when the time comes to profile our program’s CPU utilization.
We are enqueuing an unbounded amount of commands to our command buffer here, and the GPU will not start executing work until we are done building and submitting the associated command buffer. As we will later see in this course’s optimization section, this can become a problem in unusual execution configurations where thousands of simulations steps occur between each generated image. The way to fix this problem will be discussed in the corresponding course chapter, after taking care of higher-priority optimizations.

Simulation runner

With our last utility function written down, it is time to tackle the meat of the issue, and adapt our formerly CPU-centric run_simulation() utility so that it can run the GPU computation.

And while we are at it, we will also fix another design issue of our former CPU code, which is that we needed to duplicate a lot of logic between out simulation binary and our microbenchmark.

As this logic is getting more complicated in our GPU version, this is becoming a more pressing problem. So we will fix it, at the expense of reducing our benchmark’s level of detail, by generalizing run_simulation() so that it is useful for microbenchmarking in addition to regular simulation:

use self::data::UVSet;
use crate::context::Context;
use vulkano::{
    command_buffer::{CommandBufferUsage, PrimaryCommandBufferAbstract},
    sync::GpuFuture,
};

/// Simulation runner, with a user-configurable hook to...
///
/// - Schedule extra work in the command buffer where the simulation steps are
///   being recorded, knowing the final simulation state.
/// - Perform extra work after the GPU is done executing work.
pub fn run_simulation(
    options: &Options,
    context: &Context,
    mut schedule_after_simulation: impl FnMut(
        &UVSet,
        &mut AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>,
    ) -> Result<()>,
    mut after_gpu_wait: impl FnMut() -> Result<()>,
) -> Result<()> {
    // Set up the GPU compute pipelines
    let pipelines = Pipelines::new(context, options)?;

    // Set up the initial command buffer
    let new_cmdbuf = || {
        AutoCommandBufferBuilder::primary(
            context.comm_allocator.clone(),
            context.queue.queue_family_index(),
            CommandBufferUsage::OneTimeSubmit,
        )
    };
    let mut cmdbuf = new_cmdbuf()?;

    // Set up the concentrations storage and schedule initialization
    let mut concentrations =
        Concentrations::create_and_schedule_init(options, context, &pipelines, &mut cmdbuf)?;

    // Produce the requested amount of concentration tables
    for _ in 0..options.runner.num_output_steps {
        // Schedule some simulation steps
        schedule_simulation(options, &pipelines, &mut cmdbuf, &mut concentrations)?;

        // Schedule any other user-requested work after the simulation
        schedule_after_simulation(concentrations.current(), &mut cmdbuf)?;

        // Submit the work to the GPU and wait for it to execute
        cmdbuf
            .build()?
            .execute(context.queue.clone())?
            .then_signal_fence_and_flush()?
            .wait(None)?;

        // Perform operations after the GPU is done
        after_gpu_wait()?;

        // Set up the next command buffer
        cmdbuf = new_cmdbuf()?;
    }
    Ok(())
}

Here is a summary of changes with respect to the previous version of run_simulation():

HDF5 I/O concerns are not handled by run_simulation() anymore. This concern is now offloaded to the caller, which can handle it using a pair of new user-defined hooks:¹
- schedule_after_simulation() is called after the simulation engine is done filling up a command buffer with simulation pipeline executions. It lets the caller add GPU commands to e.g. download the final simulation state from the CPU to the GPU.
- after_gpu_wait() is called after waiting for the GPU to be done. It lets the caller e.g. read the final CPU copy of the concentration of V and save it to disk.
The update() hook is gone, as in GPU programs there are fewer opportunities than in CPU programs to optimize the simulation update logic without modifying other aspects of the simulation like the data management, so this configurability does not pull its weight.
The caller is now expected to pass in a pre-initialized GPU context.
The result type is more general than before (where it used to be HDF5-specific) to account for the new possibility of GPU API errors.
GPU compute pipelines and command buffers must now be set up, and command buffers must also be submitted to the GPU and awaited.
The simulation runner does not manage individual simulation steps anymore, as on a GPU this would have unbearable synchronization costs. Instead, simulation steps are executed as batches of size compute_steps_per_output_step.

And with that, we are done with the parts of the simulation logic that are shared between the main binary and the microbenchmark, so you can basically replace the entire contents of exercises/src/grayscott/mod.rs with the above two functions.

Main simulation binary

Now that we have altered the API contract of run_simulation(), we also need to rewrite much of the main simulation binary accordingly. The part up to Vulkan context setup remains the same, but then we need to do this:

use grayscott_exercises::grayscott::{data::VBuffer, io::HDF5Writer};

fn main() -> Result<()> {
    // [ ... parse CLI options, set up progress bar & Vulkan context ... ]

    // Set up the CPU buffer for concentrations download
    let vbuffer = RefCell::new(VBuffer::new(&options, &context)?);

    // Set up the HDF5 file output
    let mut hdf5 = HDF5Writer::create(
        &options.runner.file_name,
        [options.runner.num_rows, options.runner.num_cols],
        options.runner.num_output_steps,
    )?;

    // Run the simulation
    grayscott::run_simulation(
        &options,
        &context,
        |uv, cmdbuf| {
            // Schedule a download of the final simulation state
            vbuffer.borrow_mut().schedule_update(uv, cmdbuf)
        },
        || {
            // Write down the current simulation output
            vbuffer.borrow().read_and_process(|v| hdf5.write(v))??;

            // Record that progress has been made
            progress.inc(options.runner.compute_steps_per_output_step as u64);
            Ok(())
        },
    )?;

    // Close the HDF5 file with proper error handling
    hdf5.close()?;

    // Declare the computation finished
    progress.finish();
    Ok(())
}

What is new here?

We need to set up a CPU buffer to download our GPU data into. And because we are using a run_simulation() design with two hooks that both use this vbuffer (see footnote¹), the Rust compiler’s static lifetime analysis gets overwhelmed and we need to switch to dynamic lifetime analysis (RefCell) to work around it.
Because HDF5 I/O is now the responsibility of the simulate binary, we take care of it here.
We leverage the two hooks provided by run_simulation() for their intended purpose: to download GPU results to the CPU, save them to the HDF5 file, and record that progress has been made in our progress bar.

Exercises

Integrate the above code into the main simulation binary (exercises/src/bin/simulate.rs), then…

Do a simulation test run (cargo run --release -- -n100)
Use mkdir -p pics && data-to-pics -o pics to convert the output data into PNG images
Use your favorite image viewer to check that the resulting images look about right

Beyond that, the simulate benchmark (exercises/benches/simulate.rs) has been pre-written for you in order to exercise the final simulation engine in various configurations. Check out the code to get a general idea of how it works, then run it for a while (cargo bench --bench simulate) and see how the various tunable parameters affect performance.

Do not forget that you can also pass in a regular expression argument (as in e.g. cargo bench --bench simulate -- '2048x.*32steps.*compute$') in order to only benchmark specific configurations.

This could be done more cleanly with a single trait, but the author has not yet found a way to introduce the awesomeness of traits in sufficient depth in this time-constrained course. ↩ ↩2

Keyboard shortcuts

Numerical Computing with Rust on GPU

Integration

Simulation commands

Simulation runner

Main simulation binary

Exercises