Kernel fission

TODO: Reword based on preceding chapters

In the previous chapters, we have been implementing code optimizations that make more parts of our simulation execute in parallel through a mixture of asynchronous execution and pipelining.

As a result, we went from a rather complex situation where our simulation speed was limited by various hardware performance characteristics depending on the configuration in which we executed it, to a simpler situation where our simulation is more and more often bottlenecked by the raw speed at which we perform simulation steps.

This is shown by the fact that in our reference configuration, where our simulation domain contains about 2 billion pixels and we perform 32 simulation steps per generated image, the simulation speed that is measured by our microbenchmark does not change that much anymore as we go from a pure GPU compute scenario to a scenario where we additionally download data from the GPU to the CPU and post-process it on the CPU:

run_simulation/workgroup32x16/domain2048x1024/total512/image32/compute
                        time:   [83.137 ms 83.266 ms 83.421 ms]
                        thrpt:  [12.871 Gelem/s 12.895 Gelem/s 12.915 Gelem/s]
run_simulation/workgroup32x16/domain2048x1024/total512/image32/compute+download
                        time:   [102.03 ms 102.49 ms 102.95 ms]
                        thrpt:  [10.429 Gelem/s 10.477 Gelem/s 10.524 Gelem/s]
run_simulation/workgroup32x16/domain2048x1024/total512/image32/compute+download+sum
                        time:   [102.98 ms 103.21 ms 103.36 ms]
                        thrpt:  [10.389 Gelem/s 10.404 Gelem/s 10.427 Gelem/s]

To go faster, we will therefore need to experiment with ways to make our simulation steps faster. Which is what the next optimization chapters of this course will focus on.

As a first step, we will try a classic GPU optimization strategy: when given the choice between more concurrent work and less redundant work, always try more concurrency first.

A concurrency opportunity

Recall that our GPU program for performing a simulation step looks like this:

void main() {
    // Map work items into 2D central region, discard out-of-bounds work items
    const uvec2 pos = uvec2(gl_GlobalInvocationID.xy) + DATA_START_POS;
    if (any(greaterThanEqual(pos, data_end_pos()))) {
        return;
    }

    // Load central value
    const vec2 uv = read(pos);

    // Compute the diffusion gradient for U and V
    const uvec2 topleft = pos - uvec2(1);
    const mat3 weights = stencil_weights();
    vec2 full_uv = vec2(0.0);
    for (int y = 0; y < 3; ++y) {
        for (int x = 0; x < 3; ++x) {
            const vec2 stencil_uv = read(topleft + uvec2(x, y));
            full_uv += weights[x][y] * (stencil_uv - uv);
        }
    }

    // Deduce the change in U and V concentration
    const float u = uv.x;
    const float v = uv.y;
    const float uv_square = u * v * v;
    const vec2 delta_uv = diffusion_rate() * full_uv + vec2(
        FEED_RATE * (1.0 - u) - uv_square,
        uv_square - (FEED_RATE + KILL_RATE) * v
    );
    write(pos, uv + delta_uv * DELTA_T);
}

If you look at it carefully, you will realize that it is almost two computations bundled into one:

The initial full_uv diffusion gradient computation independently computes the diffusion gradient for species U and V, as two different lanes of a GLSL vec2 type.
The final delta_uv computation is literally two different computations vectorized into one, with the exception of the uv_square product which is only computed once but used by the U and V components of delta_uv.

As it turns out, GPU hardware isn’t very good at this sort of explicit vectorization,¹ and GLSL vector types like vec2 are more useful for ergonomics than for performance. Therefore, in case our computation doesn’t have enough internal concurrency to saturate the GPU’s compute units (which is likely to be the case on higher-end GPUs) , we may benefit from splitting this computation into two, one for U and one for V, even though this will come at the expense of a bit of redundant work.

Indeed, there is a price to pay for the extra concurrency enabled by this alternate strategy, which is that the computation for one chemical species will need to load the central value of the other species and to perform a pair of multiplications in order to compute the uv_square product.

Separating the U and V computations

In the common.comp GLSL source file, which contains code shared between our compute shaders, we will supplement the existing read() and write() functions with alternatives that only read and write a single species’ concentration…

// Read U or V from a particular input location, pos works as in read()
float read_one(uint species, uvec2 pos) {
    const uint index = pos_to_index(pos);
    return Input[species].data[index];
}

// Write U or V to a particular output location, pos works as in read()
void write_one(uint species, uvec2 pos, float value) {
    const uint index = pos_to_index(pos);
    Output[species].data[index] = value;
}

…and we will extract a variation of our former diffusion gradient computation that only performs the computation for a single chemical species:

// Diffusion gradient computation for a single species
float diffusion_gradient(uint species, uvec2 pos, float center_value) {
    const uvec2 topleft = pos - uvec2(1);
    const mat3 weights = stencil_weights();
    float gradient = 0.0;
    for (int y = 0; y < 3; ++y) {
        for (int x = 0; x < 3; ++x) {
            const float stencil = read_one(species, topleft + uvec2(x, y));
            gradient += weights[x][y] * (stencil - center_value);
        }
    }
    return gradient;
}

Given this infrastructure, we will be able to write a compute shader that only computes the concentration of the U species, which we will save to a file called step_u.comp…

#version 460

#include "common.comp"

// Simulation step entry point for the U species
void main() {
    // Map work items into 2D central region, discard out-of-bounds work items
    const uvec2 pos = uvec2(gl_GlobalInvocationID.xy) + DATA_START_POS;
    if (any(greaterThanEqual(pos, data_end_pos()))) {
        return;
    }

    // Load central value of U and V
    const vec2 uv = read(pos);
    const float u = uv.x;
    const float v = uv.y;

    // Compute the diffusion gradient for U
    const float full_u = diffusion_gradient(U, pos, u);

    // Deduce the change in U concentration
    const float uv_square = u * v * v;
    const float delta_u = DIFFUSION_RATE_U * full_u
                          - uv_square
                          + FEED_RATE * (1.0 - u);
    write_one(U, pos, u + delta_u * DELTA_T);
}

…and another compute shader that only computes the concentration of the V species, which we will save to a fille called step_v.comp…

#version 460

#include "common.comp"

// Simulation step entry point for the V species
void main() {
    // Map work items into 2D central region, discard out-of-bounds work items
    const uvec2 pos = uvec2(gl_GlobalInvocationID.xy) + DATA_START_POS;
    if (any(greaterThanEqual(pos, data_end_pos()))) {
        return;
    }

    // Load central value of U and V
    const vec2 uv = read(pos);
    const float u = uv.x;
    const float v = uv.y;

    // Compute the diffusion gradient for V
    const float full_v = diffusion_gradient(V, pos, v);

    // Deduce the change in V concentration
    const float uv_square = u * v * v;
    const float delta_v = DIFFUSION_RATE_V * full_v
                          + uv_square
                          - (FEED_RATE + KILL_RATE) * v;
    write_one(V, pos, v + delta_v * DELTA_T);
}

…without suffering an unbearable amount of GLSL code duplication between these two compute shaders, though there will obviously be similarities.

Adapting the CPU code

To support this new style of computation, we will need to extend our pipeline.rs module so that it builds two compute pipelines instead of one. This can be done by adjusting our vulkano_shaders::shader macro call to cover the two new step_u and step_v compute shaders…

/// Shader modules used for the compute pipelines
mod shader {
    vulkano_shaders::shader! {
        shaders: {
            init: {
                ty: "compute",
                path: "src/grayscott/init.comp"
            },
            step_u: {
                ty: "compute",
                path: "src/grayscott/step_u.comp"
            },
            step_v: {
                ty: "compute",
                path: "src/grayscott/step_v.comp"
            },
        }
    }
}

…and adjusting the Pipelines struct and its constructor in order to build this new pair of compute pipelines, where it used to build a single step compute pipeline before:

/// Initialization and simulation pipelines with common layout information
#[derive(Clone)]
pub struct Pipelines {
    /// Compute pipeline used to initialize the concentration tables
    pub init: Arc<ComputePipeline>,

    /// Compute pipeline used to perform a simulation step for U
    pub step_u: Arc<ComputePipeline>,

    /// Compute pipeline used to perform a simulation step for V
    pub step_v: Arc<ComputePipeline>,

    /// Pipeline layout shared by `init` and `step`
    pub layout: Arc<PipelineLayout>,
}
//
impl Pipelines {
    /// Set up all the compute pipelines
    pub fn new(options: &RunnerOptions, context: &Context) -> Result<Self> {
        // Common logic for setting up a compute pipeline stage
        fn setup_stage_info(
            options: &RunnerOptions,
            context: &Context,
            load: impl FnOnce(
                Arc<Device>,
            )
                -> std::result::Result<Arc<ShaderModule>, Validated<VulkanError>>,
        ) -> Result<PipelineShaderStageCreateInfo> {
            let module = load(context.device.clone())?;
            let module = setup_shader_module(options, module)?;
            Ok(setup_compute_stage(module))
        }

        // Set up the initialization and step pipeline stages, making sure that
        // they share a common pipeline layout so that they can use the same
        // resource descriptor sets later on
        let init_stage_info = setup_stage_info(options, context, shader::load_init)?;
        let step_u_stage_info = setup_stage_info(options, context, shader::load_step_u)?;
        let step_v_stage_info = setup_stage_info(options, context, shader::load_step_v)?;
        let layout = setup_pipeline_layout(
            context.device.clone(),
            [&init_stage_info, &step_u_stage_info, &step_v_stage_info],
        )?;

        // Finish setting up the initialization and stepping compute pipelines
        let setup_compute_pipeline = |stage_info: PipelineShaderStageCreateInfo| {
            let pipeline_info = ComputePipelineCreateInfo::stage_layout(stage_info, layout.clone());
            ComputePipeline::new(
                context.device.clone(),
                Some(context.pipeline_cache()),
                pipeline_info,
            )
        };
        Ok(Self {
            init: setup_compute_pipeline(init_stage_info)?,
            step_u: setup_compute_pipeline(step_u_stage_info)?,
            step_v: setup_compute_pipeline(step_v_stage_info)?,
            layout,
        })
    }
}

Finally, we will modify the schedule_simulation() function so that it uses this pair of compute pipelines instead of a single compute pipeline:

/// Record the commands needed to run a bunch of simulation iterations
fn schedule_simulation(
    options: &RunnerOptions,
    pipelines: &Pipelines,
    concentrations: &mut Concentrations,
    cmdbuild: &mut CommandBufferBuilder,
) -> Result<()> {
    // [ ... compute dispatch size computation is unchanged ... ]

    // Schedule the requested number of simulation steps
    for _ in 0..options.steps_per_image {
        concentrations.update(|inout_set| {
            cmdbuild.bind_descriptor_sets(
                PipelineBindPoint::Compute,
                pipelines.layout.clone(),
                INOUT_SET,
                inout_set,
            )?;
            for pipeline in [&pipelines.step_u, &pipelines.step_v] {
                cmdbuild.bind_pipeline_compute(pipeline.clone())?;
                // SAFETY: GPU shader has been checked for absence of undefined behavior
                //         given a correct execution configuration, and this is one
                unsafe {
                    cmdbuild.dispatch(simulate_workgroups)?;
                }
            }
            Ok(())
        })?;
    }
    Ok(())
}

Exercise

Implement this optimization and measure its impact on the GPU(s) that you have access to.

On the author’s AMD Radeon RX 5600M laptop GPU, the net impact of this optimization is negative (20-30% slowdown), which means the extra concurrency that we gain is not worth the extra computational costs that we pay for. But on a higher-end GPU, the tradeoff may be different.

Barring exceptions like half-precision floating point numbers and special-purpose “tensor cores”. ↩

Keyboard shortcuts

Numerical Computing with Rust on GPU

Kernel fission

A concurrency opportunity

Separating the U and V computations

Adapting the CPU code

Exercise