Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Integration

After a long journey, we are once again reaching the last mile where we almost have a complete Gray-Scott reaction simulation. In this chapter, we will proceed to walk this last mile and get everything working again, on GPU this time.

Command buffer building

As we go through the optimization chapters, we will want to build command buffers for submitting work to the GPU in a growing number of places. Because our needs are rather simple, we will always build them in the same way, which is worth extracting into a utility function and type alias:

use crate::{context::Context, Result};
use vulkano::command_buffer::{
    AutoCommandBufferBuilder, CommandBufferUsage, PrimaryAutoCommandBuffer
};

/// Convenience type alias for primary command buffer builders
type CommandBufferBuilder = AutoCommandBufferBuilder<PrimaryAutoCommandBuffer>;

/// Set up a new command buffer builder
fn command_buffer_builder(context: &Context) -> Result<CommandBufferBuilder> {
    let cmdbuild = CommandBufferBuilder::primary(
        context.comm_allocator.clone(),
        context.queue.queue_family_index(),
        CommandBufferUsage::OneTimeSubmit,
    )?;
    Ok(cmdbuild)
}

The configuration encoded in this utility function sets up command buffers that are…

  • Recorded using vulkano’s high-level and safe AutoCommandBufferBuilder API (hidden behind the CommandBufferBuilder type alias), which takes care of injecting safety-cricial pipeline and memory barriers between commands for us.
  • Primary command buffers, which can be submitted to the GPU as-is. This is in contrast to secondary buffers, which must be added to a primary buffer before submission.
  • Private to a particular Vulkan queue. This is a straightforward decision right now as we only use a single Vulkan queue, but we will need to revisit this later on.
  • Meant for one-time use only. Per GPU vendor documentation, this is the recommended default for programs that are not bottlenecked by command buffer recording, which is our case. Reusable command buffers only save command buffer recording time at the expense of increasing GPU driver overhead and/or reducing command buffer execution efficiency on the GPU side, and that is not the right tradeoff for us.

Simulation commands

In the previous chapter, we have attempted to increase separation of concerns across the simulation codebase so that one function is not responsible for all command buffer manipulation work.

Thanks to this work, we can have a simulation scheduling function that is conceptually simpler than the build_command_buffer() function we used to have in our number-squaring program:

use self::{
    data::Concentrations,
    options::RunnerOptions,
    pipeline::{Pipelines, INOUT_SET},
};
use vulkano::pipeline::PipelineBindPoint;

/// Record the commands needed to run a bunch of simulation iterations
fn schedule_simulation(
    options: &RunnerOptions,
    pipelines: &Pipelines,
    concentrations: &mut Concentrations,
    cmdbuild: &mut CommandBufferBuilder,
) -> Result<()> {
    // Determine the appropriate dispatch size for the simulation
    let dispatch_size = |domain_size: usize, workgroup_size: NonZeroU32| {
        domain_size.div_ceil(workgroup_size.get() as usize) as u32
    };
    let simulate_workgroups = [
        dispatch_size(options.num_cols, options.pipeline.workgroup_cols),
        dispatch_size(options.num_rows, options.pipeline.workgroup_rows),
        1,
    ];

    // Schedule the requested number of simulation steps
    cmdbuild.bind_pipeline_compute(pipelines.step.clone())?;
    for _ in 0..options.steps_per_image {
        concentrations.update(|inout_set| {
            cmdbuild.bind_descriptor_sets(
                PipelineBindPoint::Compute,
                pipelines.layout.clone(),
                INOUT_SET,
                inout_set,
            )?;
            // SAFETY: GPU shader has been checked for absence of undefined behavior
            //         given a correct execution configuration, and this is one
            unsafe {
                cmdbuild.dispatch(simulate_workgroups)?;
            }
            Ok(())
        })?;
    }
    Ok(())
}

There are a few things worth pointing out here:

  • Unlike our former build_command_buffer() function, this function does not build its own command buffer, but only adds extra commands to a caller-allocated existing command buffer. This will allow us to handle data initialization more elegantly later.
  • We are computing the compute pipeline dispatch size on each run of this function, which depending on compiler optimizations may or may not result in redundant work. The quantitative overhead of this work should be so small compared to everything else in this function that we do not expect this small inefficiency to matter. But we will check this when the time comes to profile our program’s CPU utilization.
  • We are enqueuing an unbounded amount of commands to our command buffer here, and the GPU will not start executing work until we are done building and submitting the associated command buffer. As we will later see in this course’s optimization section, this can become a problem in unusual execution configurations where thousands of simulations steps occur between each generated image. The way to fix this problem will be discussed in the corresponding course chapter, after taking care of higher-priority optimizations.

Output processing

In the CPU simulation, the top-level run_simulation() function would unconditionally accept a process_v callback that receives the V species’ concentration as an &Array2<Float> and saves it to disk. We should change this in the GPU version for two different reasons:

  • Downloading the V species’ concentration from the GPU side to the CPU side can be expensive. By allowing the caller not to do so, we can have more focused microbenchmarks that measure our simulation’s performance in a finer-grained way:
    • One “compute” benchmark will measure the raw speed at which we perform simulation steps, taking GPU-to-CPU downloads out of the equation.
    • One “compute+download” benchmark will download GPU outputs to the CPU side, without using them. By comparing the performance of this benchmark to that of the “compute” benchmark, we will see how efficiently we handle GPU-to-CPU downloads.
    • One “compute+download+sum” benchmark will download GPU outputs to the CPU side and use them by computing their sum on the CPU side. By comparing the performance of this benchmark to that of the “compute+download” benchmark, we will see how well we can overlap GPU and CPU work through asynchronous GPU execution.
    • …and finally it will remain possible to use the simulate binary to study the simulation’s HDF5 I/O performance on a particular machine.
  • Due to the existence of padding zeroes and peculiarities of the Subbuffer API from vulkano, VBuffer::process() is unable to provide the V species concentration as an &Array2<Float> reference to an owned N-dimensional array. It must instead provide an ArrayView2<Float> over its internal CPU-accessible dataset.

Taking all this together, the run_simulation() function signature should be changed as follows…

use self::data::Float;
use ndarray::ArrayView2;

/// Simulation runner, with a user-specified output processing function
pub fn run_simulation<ProcessV: FnMut(ArrayView2<Float>) -> Result<()>>(
    options: &RunnerOptions,
    context: &Context,
    process_v: Option<ProcessV>,
) -> Result<()> {
    // [ ... simulation logic will go here ... ]
}

…but if you have some previous Rust experience, that Option<ProcessV> function parameter that refers to a generic function parameter in an optional manner will make you uneasy.

Indeed, such function signatures have a nasty tendency to cause type inference problems, because when we set the process_v parameter to None on the caller side…

// This is an example of run_simulation() call site that you need not copy

use grayscott_exercises::run_simulation;

// The Rust compiler will reject this call as there is no way to infer ProcessV
run_simulation(options, context, None)?;

…the compiler is provided with no information to guess what the ProcessV generic type might be and will error out as a result.

In an ideal world, we could just resolve this by giving the ProcessV parameter of the run_simulation() function a default value. But we are not living in this ideal world, and Rust does not allow functions to have default type parameters yet. It has been attempted before, but the infrastructure was not ready at the time, so the unstable feature has been removed for now.

Failing that, one workaround we can use today is to define a type alias for the default type parameter that we would like to have…

/// Dummy `ProcessV` type, to be used when you do not specify a `process_v` hook
/// as an input to `run_simulation()`
pub type DummyProcessV = fn(ArrayView2<Float>) -> Result<()>;

…and advise callers to use this type alias as follows when needed:

// This is another example of run_simulation() call site that you need not copy

use grayscott_exercises::{DummyProcessV, run_simulation};

// This ProcessV type inference hint will make the Rust compiler happy
run_simulation::<DummyProcessV>(options, context, None)?;

There are other workarounds for this annoying language/compiler limitation, such as using dynamic dispatch instead of static dispatch, and those come with different tradeoffs. But for the purpose of this course, this particular workaround will be good enough.

Simulation runner

Now that the signature of our run_simulation() function is fixed, it is time to ask the question: what code should we put inside of it?

Because GPU code is more complex than CPU code, doing everything inside of the body of run_simulation() as we did in the CPU course will result in a function that is rather complex, and will become inscrutably so after a few optimizations. Therefore, we will extract most logic into a SimulationRunner struct that provides the following methods:

  • A new() constructor sets up everything needed to run simulation steps
  • A schedule_next_output() method prepares GPU commands to produce one output image
  • A process_output() method is called after the work scheduled by schedule_next_output() is done executing, and handles CPU-side post-processing such as saving output data to disk

This will leave the top-level run_simulation() function focused on high-level simulation steering logic, thus making the simulation code easier to understand overall.

Definition

We will begin by centralizing all state needed to run the simulation into a single struct:

use self::data::VBuffer;

/// State of the simulation
struct SimulationRunner<'run_simulation, ProcessV> {
    /// Configuration that was passed to [`run_simulation()`]
    options: &'run_simulation RunnerOptions,

    /// Vulkan context that was passed to [`run_simulation()`]
    context: &'run_simulation Context,

    /// Compute pipelines used to perform simulation steps
    pipelines: Pipelines,

    /// Chemical concentration storage
    concentrations: Concentrations,

    /// Output processing logic, if enabled
    output_handler: Option<OutputHandler<ProcessV>>,

    /// Next command buffer to be executed
    cmdbuild: CommandBufferBuilder,
}
//
/// State associated with output downloads and post-processing
struct OutputHandler<ProcessV> {
    /// CPU-accessible location to which GPU outputs should be downloaded
    v_buffer: VBuffer,

    /// User-defined post-processing logic for this CPU data
    process_v: ProcessV,
}

While largely straightforward, this pair of struct definitions uses a couple of Rust type system features that have not been presented in this course yet:

  • A structis allowed to contain references to external state, but these references must be associated with lifetime parameters, whose name starts with a single quote. Here the two references come from parameters of the run_simulation() function, and are thus associated with a single lifetime called 'run_simulation.1
  • Generic Rust types do not need to specify all their trait bounds upfront. They can introduce a type parameter without any trait bound, and narrow down required trait bounds where needed later on. This is a pretty useful trick in order to avoid cluttering generic Rust code with lots of repeated trait bounds, and here we use it to avoid stating that ProcessV must be a function with a certain signature over and over again.

Initialization

A SimulationRunner is initialized by receiving all parameters to the run_simulation() top level function and building all internal objects out of them…

impl<'run_simulation, ProcessV> SimulationRunner<'run_simulation, ProcessV>
where
    ProcessV: FnMut(ArrayView2<Float>) -> Result<()>,
{
    /// Set up the simulation
    fn new(
        options: &'run_simulation RunnerOptions,
        context: &'run_simulation Context,
        process_v: Option<ProcessV>,
    ) -> Result<Self> {
        // Set up the compute pipelines
        let pipelines = Pipelines::new(options, context)?;

        // Set up the initial command buffer builder
        let mut cmdbuild = command_buffer_builder(context)?;

        // Set up chemical concentrations storage and schedule its initialization
        let concentrations =
            Concentrations::create_and_schedule_init(options, context, &pipelines, &mut cmdbuild)?;

        // Set up the logic for post-processing V concentration, if enabled
        let output_handler = if let Some(process_v) = process_v {
            Some(OutputHandler {
                v_buffer: VBuffer::new(options, context)?,
                process_v,
            })
        } else {
            None
        };

        // We're now ready to perform simulation steps
        Ok(Self {
            options,
            context,
            pipelines,
            concentrations,
            output_handler,
            cmdbuild,
        })
    }

    // [ ... more methods coming up ... ]
}

…which, if you have been following the previous chapters, should not be terribly surprising. The main points of interest here are that…

  • The internal command buffer builder initially contains the commands needed to initialize the chemical concentration storage, which have not been executed yet.
  • An internal OutputHandler and its associated VBuffer is only set up if the user expressed interest in processing the output of the simulation. Otherwise, the internal output_handler member will forever remain None, which will disable GPU-to-CPU downloads and output post-processing in the rest of SimulationRunner.

Command buffer building

Now that the simulation has been set up, we are ready to start producing concentration images. Because GPU command execution asynchronous, this will be a three-steps process:

  1. Collect GPU commands into a command buffer.
  2. Submit the command buffer to the GPU and await its execution.
  3. Process the results on the CPU side if needed.

The schedule_next_output() method of SimulationRunner will implement the first of these three steps in the following way:

use std::sync::Arc;

impl<'run_simulation, ProcessV> SimulationRunner<'run_simulation, ProcessV>
where
    ProcessV: FnMut(ArrayView2<Float>) -> Result<()>,
{
    // [ ... ]

    /// Build a command buffer that will produce the next simulation output
    fn schedule_next_output(&mut self) -> Result<Arc<PrimaryAutoCommandBuffer>> {
        // Schedule a number of simulation steps
        schedule_simulation(
            self.options,
            &self.pipelines,
            &mut self.concentrations,
            &mut self.cmdbuild,
        )?;

        // Schedule a download of the resulting V concentration, if enabled
        if let Some(handler) = &mut self.output_handler {
            handler
                .v_buffer
                .schedule_download(&self.concentrations, &mut self.cmdbuild)?;
        }

        // Extract the old command buffer builder, replacing it with a blank one
        let old_cmdbuild =
            std::mem::replace(&mut self.cmdbuild, command_buffer_builder(self.context)?);

        // Build the command buffer
        Ok(old_cmdbuild.build()?)
    }

    // [ ... ]
}

Again, there should be nothing terribly surprising here, given the former sections of this course:

  • We schedule simulation steps in the way that was discussed earlier, after any commands initially present in the internal command buffer builder.
  • If a GPU-to-CPU download must be performed, we schedule it afterwards.
  • Finally, we replace our internal command buffer builder with a new one, and build the command buffer associated with the former command buffer builder.

This last step may seem a little convoluted when considered in isolation. What it gives us is the ability to seamlessly schedule simulation dataset initialization along with the first simulation steps.

We could instead save ourselves from the trouble of maintaining an internal command buffer builder by building a new command buffer at the start of schedule_next_output(). But then we would not be able to bundle the dataset initialization job with the first simulation steps, and thus would need a more complex initialization procedure with reduced execution efficiency.

Output processing

After the command buffer that was produced by schedule_next_output() has been submitted to the GPU is done executing, we may need to execute some CPU-side output processing steps, such as saving the output data to an HDF5 file. This work is taken care of by the third process_output() function of the SimulationRunner:

impl<'run_simulation, ProcessV> SimulationRunner<'run_simulation, ProcessV>
where
    ProcessV: FnMut(ArrayView2<Float>) -> Result<()>,
{
    // [ ... ]

    /// Process the simulation output, if enabled
    ///
    /// This method should be run after the command buffer produced by
    /// [`schedule_next_output()`](Self::schedule_next_output) has been
    /// submitted to the GPU and its execution has been awaited.
    fn process_output(&mut self) -> Result<()> {
        if let Some(handler) = &mut self.output_handler {
            handler.v_buffer.process(&mut handler.process_v)?;
        }
        Ok(())
    }
}

Putting it all together

Given these high-level building blocks, we can finally put them together by writing the new version of the run_simulation() entry point:

use vulkano::{
    command_buffer::PrimaryCommandBufferAbstract,
    sync::GpuFuture,
};

/// Simulation runner, with a user-specified output processing function
pub fn run_simulation<ProcessV: FnMut(ArrayView2<Float>) -> Result<()>>(
    options: &RunnerOptions,
    context: &Context,
    process_v: Option<ProcessV>,
) -> Result<()> {
    // Set up the simulation
    let mut runner = SimulationRunner::new(options, context, process_v)?;

    // Produce the requested amount of concentration tables
    for _ in 0..options.num_output_images {
        // Prepare a GPU command buffer that produces the next output
        let cmdbuf = runner.schedule_next_output()?;

        // Submit the work to the GPU and wait for it to execute
        cmdbuf
            .execute(context.queue.clone())?
            .then_signal_fence_and_flush()?
            .wait(None)?;

        // Process the simulation output, if enabled
        runner.process_output()?;
    }
    Ok(())
}

For now, it is very simple. It just sets up a SimulationRunner and proceeds to use it to produce the user-requested number of output images by repeatedly…

  • Preparing a command buffer that steps the simulation and downloads the output if needed
  • Submitting the command buffer to the GPU, then immediately waiting for it to execute
  • Performing any user-requested post-processing on the CPU side

If you have understood the importance of asynchronous work execution in GPU programs, this simple synchronous logic may set off some performance alarm bells in your head, but don’t worry. This is just a starting point, we will improve its performance by making more things asynchronous in the subsequent optimization chapters.

For now, we are done with the parts of the simulation logic that are shared between the main binary and the microbenchmark, so you can basically replace the entire contents of exercises/src/grayscott/mod.rs with the code described above.

Main simulation binary

Because we have altered the signature of run_simulation() to make GPU-to-CPU downloads optional, we must alter the logic of the main simulation a little bit. Its call to run_simulation() will now look like this:

use grayscott_exercises::data::Float;
use ndarray::ArrayView2;

run_simulation(
    &options.runner,
    &context,
    Some(|v: ArrayView2<Float>| {
        // Write down the current simulation output
        hdf5.write(v)?;

        // Update the progress bar to take note that one image was produced
        progress.inc(1);
        Ok(())
    }),
)?;

The main thing worth noting here is that we now need to explicitly spell out the type of data that process_v takes as input, or else type inference will pick the wrong type and you will get a strange compiler error message about ProcessV not being generic enough.

This is a consequence of the Rust compiler’s closure parameter type inference having a couple of very annoying bugs in its handling of references, whose full explanation goes well beyond the scope of this introductory course. We will just say that sometimes, you will need to hint closure parameter type inference a bit in the right direction as done here, and sometimes you will need to replace closures with something else (a true function or a trait object).

Exercise

Integrate the above code into the main simulation binary (exercises/src/bin/simulate.rs), then…

  • Do a simulation test run (cargo run --release -- -n100)
  • Use mkdir -p pics && data-to-pics -o pics to convert the output data into PNG images
  • Use your favorite image viewer to check that the resulting images look about right

Beyond that, the simulate benchmark (exercises/benches/simulate.rs) has been pre-written for you in order to exercise the final simulation engine in various configurations. Check out the code to get a general idea of how it works, then run it for a while (cargo bench --bench simulate) and see how the various tunable parameters affect performance.

Do not forget that you can also pass in a regular expression argument (as in e.g. cargo bench --bench simulate -- '2048x.*compute$') in order to only benchmark specific configurations.


  1. There is a lot more to Rust lifetimes than this short description suggests. They are basically the language constructs through which a Rust API designer can express which function inputs a function output can borrow data from, so that callers can be confident that a change to a function’s implementation will not accidentally break their code without changing the function’s signature. And the fact that we can afford to use a single lifetime for two references here hides a surprising amount of complexity.