Kernel fission
TODO: Reword based on preceding chapters
In the previous chapters, we have been implementing code optimizations that make more parts of our simulation execute in parallel through a mixture of asynchronous execution and pipelining.
As a result, we went from a rather complex situation where our simulation speed was limited by various hardware performance characteristics depending on the configuration in which we executed it, to a simpler situation where our simulation is more and more often bottlenecked by the raw speed at which we perform simulation steps.
This is shown by the fact that in our reference configuration, where our simulation domain contains about 2 billion pixels and we perform 32 simulation steps per generated image, the simulation speed that is measured by our microbenchmark does not change that much anymore as we go from a pure GPU compute scenario to a scenario where we additionally download data from the GPU to the CPU and post-process it on the CPU:
run_simulation/workgroup32x16/domain2048x1024/total512/image32/compute
time: [83.137 ms 83.266 ms 83.421 ms]
thrpt: [12.871 Gelem/s 12.895 Gelem/s 12.915 Gelem/s]
run_simulation/workgroup32x16/domain2048x1024/total512/image32/compute+download
time: [102.03 ms 102.49 ms 102.95 ms]
thrpt: [10.429 Gelem/s 10.477 Gelem/s 10.524 Gelem/s]
run_simulation/workgroup32x16/domain2048x1024/total512/image32/compute+download+sum
time: [102.98 ms 103.21 ms 103.36 ms]
thrpt: [10.389 Gelem/s 10.404 Gelem/s 10.427 Gelem/s]
To go faster, we will therefore need to experiment with ways to make our simulation steps faster. Which is what the next optimization chapters of this course will focus on.
As a first step, we will try a classic GPU optimization strategy: when given the choice between more concurrent work and less redundant work, always try more concurrency first.
A concurrency opportunity
Recall that our GPU program for performing a simulation step looks like this:
void main() {
// Map work items into 2D central region, discard out-of-bounds work items
const uvec2 pos = uvec2(gl_GlobalInvocationID.xy) + DATA_START_POS;
if (any(greaterThanEqual(pos, data_end_pos()))) {
return;
}
// Load central value
const vec2 uv = read(pos);
// Compute the diffusion gradient for U and V
const uvec2 topleft = pos - uvec2(1);
const mat3 weights = stencil_weights();
vec2 full_uv = vec2(0.0);
for (int y = 0; y < 3; ++y) {
for (int x = 0; x < 3; ++x) {
const vec2 stencil_uv = read(topleft + uvec2(x, y));
full_uv += weights[x][y] * (stencil_uv - uv);
}
}
// Deduce the change in U and V concentration
const float u = uv.x;
const float v = uv.y;
const float uv_square = u * v * v;
const vec2 delta_uv = diffusion_rate() * full_uv + vec2(
FEED_RATE * (1.0 - u) - uv_square,
uv_square - (FEED_RATE + KILL_RATE) * v
);
write(pos, uv + delta_uv * DELTA_T);
}
If you look at it carefully, you will realize that it is almost two computations bundled into one:
- The initial
full_uvdiffusion gradient computation independently computes the diffusion gradient for species U and V, as two different lanes of a GLSLvec2type. - The final
delta_uvcomputation is literally two different computations vectorized into one, with the exception of theuv_squareproduct which is only computed once but used by the U and V components ofdelta_uv.
As it turns out, GPU hardware isn’t very good at this sort of explicit
vectorization,1 and GLSL vector types like vec2 are more useful for
ergonomics than for performance. Therefore, in case our computation doesn’t have
enough internal concurrency to saturate the GPU’s compute units (which is likely
to be the case on higher-end GPUs) , we may benefit from splitting this
computation into two, one for U and one for V, even though this will come at the
expense of a bit of redundant work.
Indeed, there is a price to pay for the extra concurrency enabled by this
alternate strategy, which is that the computation for one chemical species will
need to load the central value of the other species and to perform a pair of
multiplications in order to compute the uv_square product.
Separating the U and V computations
In the common.comp GLSL source file, which contains code shared between our
compute shaders, we will supplement the existing read() and write()
functions with alternatives that only read and write a single species’
concentration…
// Read U or V from a particular input location, pos works as in read()
float read_one(uint species, uvec2 pos) {
const uint index = pos_to_index(pos);
return Input[species].data[index];
}
// Write U or V to a particular output location, pos works as in read()
void write_one(uint species, uvec2 pos, float value) {
const uint index = pos_to_index(pos);
Output[species].data[index] = value;
}
…and we will extract a variation of our former diffusion gradient computation that only performs the computation for a single chemical species:
// Diffusion gradient computation for a single species
float diffusion_gradient(uint species, uvec2 pos, float center_value) {
const uvec2 topleft = pos - uvec2(1);
const mat3 weights = stencil_weights();
float gradient = 0.0;
for (int y = 0; y < 3; ++y) {
for (int x = 0; x < 3; ++x) {
const float stencil = read_one(species, topleft + uvec2(x, y));
gradient += weights[x][y] * (stencil - center_value);
}
}
return gradient;
}
Given this infrastructure, we will be able to write a compute shader that only
computes the concentration of the U species, which we will save to a file called
step_u.comp…
#version 460
#include "common.comp"
// Simulation step entry point for the U species
void main() {
// Map work items into 2D central region, discard out-of-bounds work items
const uvec2 pos = uvec2(gl_GlobalInvocationID.xy) + DATA_START_POS;
if (any(greaterThanEqual(pos, data_end_pos()))) {
return;
}
// Load central value of U and V
const vec2 uv = read(pos);
const float u = uv.x;
const float v = uv.y;
// Compute the diffusion gradient for U
const float full_u = diffusion_gradient(U, pos, u);
// Deduce the change in U concentration
const float uv_square = u * v * v;
const float delta_u = DIFFUSION_RATE_U * full_u
- uv_square
+ FEED_RATE * (1.0 - u);
write_one(U, pos, u + delta_u * DELTA_T);
}
…and another compute shader that only computes the concentration of the V
species, which we will save to a fille called step_v.comp…
#version 460
#include "common.comp"
// Simulation step entry point for the V species
void main() {
// Map work items into 2D central region, discard out-of-bounds work items
const uvec2 pos = uvec2(gl_GlobalInvocationID.xy) + DATA_START_POS;
if (any(greaterThanEqual(pos, data_end_pos()))) {
return;
}
// Load central value of U and V
const vec2 uv = read(pos);
const float u = uv.x;
const float v = uv.y;
// Compute the diffusion gradient for V
const float full_v = diffusion_gradient(V, pos, v);
// Deduce the change in V concentration
const float uv_square = u * v * v;
const float delta_v = DIFFUSION_RATE_V * full_v
+ uv_square
- (FEED_RATE + KILL_RATE) * v;
write_one(V, pos, v + delta_v * DELTA_T);
}
…without suffering an unbearable amount of GLSL code duplication between these two compute shaders, though there will obviously be similarities.
Adapting the CPU code
To support this new style of computation, we will need to extend our
pipeline.rs module so that it builds two compute pipelines instead of one.
This can be done by adjusting our vulkano_shaders::shader macro call to cover
the two new step_u and step_v compute shaders…
/// Shader modules used for the compute pipelines
mod shader {
vulkano_shaders::shader! {
shaders: {
init: {
ty: "compute",
path: "src/grayscott/init.comp"
},
step_u: {
ty: "compute",
path: "src/grayscott/step_u.comp"
},
step_v: {
ty: "compute",
path: "src/grayscott/step_v.comp"
},
}
}
}
…and adjusting the Pipelines struct and its constructor in order to build
this new pair of compute pipelines, where it used to build a single step
compute pipeline before:
/// Initialization and simulation pipelines with common layout information
#[derive(Clone)]
pub struct Pipelines {
/// Compute pipeline used to initialize the concentration tables
pub init: Arc<ComputePipeline>,
/// Compute pipeline used to perform a simulation step for U
pub step_u: Arc<ComputePipeline>,
/// Compute pipeline used to perform a simulation step for V
pub step_v: Arc<ComputePipeline>,
/// Pipeline layout shared by `init` and `step`
pub layout: Arc<PipelineLayout>,
}
//
impl Pipelines {
/// Set up all the compute pipelines
pub fn new(options: &RunnerOptions, context: &Context) -> Result<Self> {
// Common logic for setting up a compute pipeline stage
fn setup_stage_info(
options: &RunnerOptions,
context: &Context,
load: impl FnOnce(
Arc<Device>,
)
-> std::result::Result<Arc<ShaderModule>, Validated<VulkanError>>,
) -> Result<PipelineShaderStageCreateInfo> {
let module = load(context.device.clone())?;
let module = setup_shader_module(options, module)?;
Ok(setup_compute_stage(module))
}
// Set up the initialization and step pipeline stages, making sure that
// they share a common pipeline layout so that they can use the same
// resource descriptor sets later on
let init_stage_info = setup_stage_info(options, context, shader::load_init)?;
let step_u_stage_info = setup_stage_info(options, context, shader::load_step_u)?;
let step_v_stage_info = setup_stage_info(options, context, shader::load_step_v)?;
let layout = setup_pipeline_layout(
context.device.clone(),
[&init_stage_info, &step_u_stage_info, &step_v_stage_info],
)?;
// Finish setting up the initialization and stepping compute pipelines
let setup_compute_pipeline = |stage_info: PipelineShaderStageCreateInfo| {
let pipeline_info = ComputePipelineCreateInfo::stage_layout(stage_info, layout.clone());
ComputePipeline::new(
context.device.clone(),
Some(context.pipeline_cache()),
pipeline_info,
)
};
Ok(Self {
init: setup_compute_pipeline(init_stage_info)?,
step_u: setup_compute_pipeline(step_u_stage_info)?,
step_v: setup_compute_pipeline(step_v_stage_info)?,
layout,
})
}
}
Finally, we will modify the schedule_simulation() function so that it uses
this pair of compute pipelines instead of a single compute pipeline:
/// Record the commands needed to run a bunch of simulation iterations
fn schedule_simulation(
options: &RunnerOptions,
pipelines: &Pipelines,
concentrations: &mut Concentrations,
cmdbuild: &mut CommandBufferBuilder,
) -> Result<()> {
// [ ... compute dispatch size computation is unchanged ... ]
// Schedule the requested number of simulation steps
for _ in 0..options.steps_per_image {
concentrations.update(|inout_set| {
cmdbuild.bind_descriptor_sets(
PipelineBindPoint::Compute,
pipelines.layout.clone(),
INOUT_SET,
inout_set,
)?;
for pipeline in [&pipelines.step_u, &pipelines.step_v] {
cmdbuild.bind_pipeline_compute(pipeline.clone())?;
// SAFETY: GPU shader has been checked for absence of undefined behavior
// given a correct execution configuration, and this is one
unsafe {
cmdbuild.dispatch(simulate_workgroups)?;
}
}
Ok(())
})?;
}
Ok(())
}
Exercise
Implement this optimization and measure its impact on the GPU(s) that you have access to.
On the author’s AMD Radeon RX 5600M laptop GPU, the net impact of this optimization is negative (20-30% slowdown), which means the extra concurrency that we gain is not worth the extra computational costs that we pay for. But on a higher-end GPU, the tradeoff may be different.
-
Barring exceptions like half-precision floating point numbers and special-purpose “tensor cores”. ↩