Asynchronous storage

Identifying the bottleneck

Now that our Gray-Scott reaction simulation is up and running, and seems to produce sensible results, it is time to optimize it. But this begs the question: what should we optimize first?

The author’s top suggestion here would be to use a profiling tool to analyze where time is spent. But unfortunately the GPU profiling ecosystem is messier than it should be and there is no single tool that will work for all environment configurations that you may use to take this course.

Therefore, we will have to resort to the slower approach of learning things about our application’s performance by asking ourselves questions and answering them through experiments.

One first question that we can ask is whether our application is most limited by the speed at which it performs computations or writes data down. On Linux, this question can be easily answered by comparing two timed runs of the application:

One in the default configuration, where output data is written to the main storage device.
One in a configuration where output data is written to RAM using tmpfs magic.

Because RAM is much faster than nonvolatile storage devices even when used via the tmpfs filesystem, a large difference between these two timings will be a dead giveaway that our performance is limited by storage performance…

# Write output to main storage (default)
$ rm -f output.h5  \
  && cargo build --release --bin simulate  \
  && time (cargo run --release --bin simulate && sync)
    [ ... ]
real    2m23,493s
user    0m2,612s
sys     0m6,254s

# Write output to /dev/shm ramdisk
$ rm -f /dev/shm/output.h5  \
  && cargo build --release --bin simulate  \
  && time (cargo run --release --bin simulate -- -o /dev/shm/output.h5 && sync)
    [ ... ]
real    0m16,290s
user    0m2,519s
sys     0m3,592s

…and indeed, it looks like storage performance is our main bottleneck here.

By the way, notice the usage of the sync command above, which waits for pending writes to be committed to the underlying storage. Without it, our sneaky operating system (in this case Linux) would not reliable wait for all writes to the target storage to be finished before declaring the job finished, which would make our I/O timing measurements unpredictable and meaningless.

Picking a strategy

Storage performance bottlenecks can be tackled in various ways. Here are some things that we could try in rough order of decreasing expected performance impact:

Make sure we are using the fastest available storage device that fits our needs
Install a faster storage device into the machine and use it
Store less data (e.g. spend more simulation steps between two writes)
Store lower-precision data (e.g. half-precision floats, other lossy compression)
Store the same data more efficiently (lossless compression e.g. LZ4)
Offload storage access to dedicated CPU threads so it doesn’t need to wait for compute
Tune lower-level parameters of the underlying storage I/O e.g. block size, data format…

Our performance test above was arguably already an example of strategy 1 at work: as ramdisks are almost always the fastest storage device available, they should always be considered as an option for file outputs of modest size that do not need non-volatile storage.

But because this school is focused on computation performance, we will only cover strategy 6, owing to its remarkable ease of implementation, before switching to an extreme version of option 3 where we will simply disable storage I/O and focus our attention to compute performance only.

Asynchronous I/O 101

One simple scheme for offloading I/O to a dedicated thread without changing output file contents is to have the compute and I/O thread communicate via a bounded FIFO queue.

In this scheme, the main compute thread will submit data to this queue as soon as it becomes available, while the I/O thread will fetch data from that queue and write it to the storage device. Depending on the relative speed at which each thread is working, two things may happen:

If the compute thread is faster than the I/O thread, the FIFO queue will quickly fill up until it reaches its maximal capacity, and then the compute thread will block. As I/O tasks complete, the compute thread will be awokened to compute more data. Overall…
- The I/O thread will be working 100% of the time, from its perspective it will look like input data is computed instantaneously. That’s the main goal of this optimization.
- The compute thread will be intermittently stopped to leave the I/O thread some time to process incoming data, thus preventing a scenario where data accumulates indefinitely resulting in unbounded RAM footpring growth. This process called backpressure is a vital part any well-designed asynchronous I/O implementation.
If the I/O thread were faster than the compute thread, then the situation would be somewhat reversed: the compute thread would be working 100% of the time, while the I/O thread would intermittently block waiting for data.
- This is where we would have ended up if we implemented this optimization back in the CPU course, where the computation was too slow to saturate the I/O device.
- In this situation, asynchronous I/O is a more dubious optimization because as we will see it has a small CPU cost, which we don’t want to pay when CPU computations already are the performance-limiting factor.

Real-world apps will not perform all computations and I/O transactions at the same speed, which may lead them to alternate between these two behaviors. In that case, increasing the bounded size of the FIFO queue may be helpful:

On the main compute thread side, it will allow compute to get ahead of I/O when it is faster by pushing more images in the FIFO queue…
…which will later allow the I/O thread to continue interrupted for a while if for some reason I/O transactions speed up or CPU work slows down.

Implementation

As mentioned above, one critical tuning parameter of an asynchronous I/O implementation is the size of the bounded FIFO queue that the I/O and compute thread use to communicate. Like many performance tuning parameters, we will start by exposing it as a command-line argument:

// In exercises/src/grayscott/options.rs

/// Simulation runner options
#[derive(Debug, Args)]
pub struct RunnerOptions {
    // [ ... existing entries ... ]

    /// I/O buffer size
    ///
    /// Increasing this parameter will improve the application's ability to
    /// handle jitter in the time it takes to perfom computations or I/O without
    /// interrupting the I/O stream, at the expense of increasing RAM usage.
    #[arg(short = 'i', long, default_value_t = 1)]
    pub io_buffer: usize,
}

Then, in the main simulation binary, we will proceed to extract all of our HDF5 I/O work into a dedicated thread, to which we can offload work via a bounded FIFO queue, which the Rust standard library provides in the form of synchronous Multi-Producer Single-Consumer (MPSC) channels:

// In exercises/src/bin/simulate.rs

use grayscott_exercises::grayscott::data::Float;
use ndarray::Array2;
use std::{sync::mpsc::SyncSender, thread::JoinHandle};

/// `SyncSender` for V species concentration
type Sender = SyncSender<Array2<Float>>;

/// `JoinHandle` for the I/O thread
type Joiner = JoinHandle<hdf5::Result<()>>;

/// Set up an I/O thread
fn setup_io_thread(options: &Options, progress: ProgressBar) -> hdf5::Result<(Sender, Joiner)> {
    let (sender, receiver) = std::sync::mpsc::sync_channel(options.runner.io_buffer);
    let mut hdf5 = HDF5Writer::create(
        &options.runner.file_name,
        [options.runner.num_rows, options.runner.num_cols],
        options.runner.num_output_steps,
    )?;
    let compute_steps_per_output_step = options.runner.compute_steps_per_output_step as u64;
    let handle = std::thread::spawn(move || {
        for v in receiver {
            hdf5.write(v)?;
            progress.inc(compute_steps_per_output_step);
        }
        hdf5.close()?;
        progress.finish();
        Ok(())
    });
    Ok((sender, handle))
}

Usage of MPSC channels aside, the main notable thing in the above code is the use of the std::thread::spawn API to spawn an I/O thread. This API returns a JoinHandle, which can later be used to wait for the I/O thread to be done processing all previously sent work.

Another thing that the astute reader will notice about the above code is that it consumes the V species’ concentration as an owned table, rather than a borrowed view. This is necessary because after sending the concentration data to the I/O thread, the compute thread will not wait for I/O and immediately proceed to overwrite the associated VBuffer with new data.

But this also means that we will always be sending owned data to our HDF5 writer, so we can drop our data-cloning workaround and redefine the writer’s interface to accept owned data instead:

// In exercises/src/grayscott/io.rs

use ndarray::Array2;

impl HDF5Writer {
    [ ... ]

    /// Write a new V species concentration table to the file
    pub fn write(&mut self, v: Array2<Float>) -> hdf5::Result<()> {
        self.dataset.write_slice(&v, (self.position, .., ..))?;
        self.position += 1;
        Ok(())
    }

    [ ... ]
}

Finally, we can rewrite our main simulation function to use the new threaded I/O infrastructure…

// In exercises/src/bin/simulate.rs

fn main() -> Result<()> {
    // Parse command line options
    let options = Options::parse();

    // Set up the progress bar
    let progress = ProgressBar::new(
        (options.runner.num_output_steps * options.runner.compute_steps_per_output_step) as u64,
    );

    // Start the I/O thread
    let (io_sender, io_handle) = setup_io_thread(&options, progress.clone())?;

    // Set up the Vulkan context
    let context = Context::new(&options.context, false, Some(progress))?;

    // Set up the CPU buffer for concentrations download
    let vbuffer = RefCell::new(VBuffer::new(&options, &context)?);

    // Run the simulation
    grayscott::run_simulation(
        &options,
        &context,
        |uv, cmdbuf| {
            // Schedule a download of the final simulation state
            vbuffer.borrow_mut().schedule_update(uv, cmdbuf)
        },
        || {
            // Schedule a write of the current simulation output
            vbuffer
                .borrow()
                .read_and_process(|v| io_sender.send(v.to_owned()))??;
            Ok(())
        },
    )?;

    // Signal the I/O thread that we are done writing, then wait for it to finish
    std::mem::drop(io_sender);
    io_handle.join().expect("The I/O thread has crashed")?;
    Ok(())
}

Most of this should be unsurprising to you if you understood the above explanations, but there is a bit of trickery at the end that is worth highlighting.

// Signal the I/O thread that we are done writing, then wait for it to finish
std::mem::drop(io_sender);
io_handle.join().expect("The I/O thread has crashed")?;

These two lines work around a surprising number of Rust standard library usability gotchas:

To properly handle unexpected errors in Rust threads (e.g. panics due to incorrect array indexing), it is a good idea to explicitly join them…
- …but the associated join() method returns a Result type whose error type does not implement the standard Error trait, so we can only handle it via panicking.
Rust MPSC channels have a very convenient feature which ensures that we can tell a thread that we are done sending data by simply dropping the channel’s SyncSender input interface, which happens automatically when it goes out of scope…
- …but that may be too late in present of explicit .join() as the main thread may end up waiting on the I/O thread, which itself is waiting for the main thread to stop sending data, resulting in deadlock. To avoid this, we must explicitly drop the SyncSender somehow. Here we are using std::mem::drop() for this.

In any case, we are now ready to reap the benefits of our optimization, which will be most visible on fast storage backends like tmpfs:

# Command
$ rm -f /dev/shm/output.h5  \
  && cargo build --release --bin simulate  \
  && time (cargo run --release --bin simulate -- -o /dev/shm/output.h5 && sync)

# Before
real    0m16,290s
user    0m2,519s
sys     0m3,592s

# Now
real    0m11,217s
user    0m2,750s
sys     0m5,025s

Exercise

Implement the above optimization, and study its impact on your machine for all storage devices that you have access to, starting with tmpfs where the effect should be most noticeable.

On Linux, you may experience a problem where the system intermittently locks up above a certain level of I/O pressure. If that happens, consider tuning down the number of output images that your benchmark generates (-n argument to simulate) in order to keep the system responsive.

Finally, try tuning the io_buffer parameter and see what effect it has. Note that setting this parameter to 0 is meaningful and still allows I/O and computations to overlap. It only means that it is not legal for the compute thread to leave a pre-rendered image around in memory then start rendering another one, instead it must wait for the I/O thread to pick up the newly rendered image before it can start rendering another one.