name: rust-wgpu-functional
description: >
Idiomatic Rust skill for wgpu, bare-metal GPU programming, game engines, renderers,
and high-performance systems code. Applies functional programming principles — purity,
composition, algebraic types, zero-cost abstractions — while never sacrificing runtime
performance. Use this skill whenever the user asks to write Rust code involving wgpu,
GPU pipelines, render passes, compute shaders, ECS, game loops, real-time systems,
bare-metal performance code, or any Rust project where performance is a hard constraint.
Also trigger when the user asks for Rust code reviews, refactors, or architecture
advice where functional idioms and performance must coexist. Trigger even when the
user doesn't explicitly mention "functional" — if they paste Rust code that touches
wgpu::Device, wgpu::Queue, wgpu::CommandEncoder, or any GPU resource type, use this
skill. Trigger on mentions of: wgpu, winit, WGSL, render pipeline, compute pipeline,
bind group, GPU buffer, ECS, bevy (low-level), bare-metal Rust, zero-copy, SIMD,
cache-friendly, data-oriented design, game engine, renderer, frame loop, shader,
vertex buffer, index buffer, texture, sampler, depth buffer, compute dispatch,
particle system, scene graph, frustum culling, instanced rendering, indirect draw,
GPU profiling, or any Rust performance-critical system. Also trigger when the user
asks to "make this faster", "reduce allocations", "optimize this Rust", "review my
Rust code", or pastes Rust code with clone(), unwrap(), Box, or Vec allocations
in hot paths that could benefit from functional refactoring.
Rust · wgpu · Functional Performance Engineering
This skill produces idiomatic Rust code that is functionally composed, performance-first, and maintainable at scale. Every decision follows one meta-principle:
Make the implicit explicit. Make the illegal impossible. Make the fast path the only path.
Core Philosophy
Performance and clean architecture are not in tension — they compound. Splitting code into small pure functions enables inlining, monomorphization, and branch elimination. Algebraic types with exhaustive matching eliminate runtime checks. Immutability enables fearless concurrency. The compiler rewards functional discipline with faster binaries.
When in doubt: prefer the approach that gives the compiler more information at compile time and less work at runtime.
Decision Framework
Before writing any code, evaluate along these axes:
- Hot path or cold path? Hot paths get inlined, unrolled, cache-optimized. Cold paths get clarity and modularity. Never optimize cold paths at the expense of readability.
- Compile-time or runtime? Push every decision possible to compile time — generics, const, type-state patterns. Runtime dynamism (dyn Trait, HashMap lookups) only when the domain genuinely requires it.
- Owned or borrowed? Default to borrowing. Own only at boundaries (async tasks, thread spawns, storage). Never clone to satisfy the borrow checker — restructure instead.
- Allocation budget? Every
Vec::new(),Box,Stringis a decision. Pre-allocate, reuse, pool. Arena allocators for frame-scoped data. - Measured or assumed? Distinguish "removes known work" from "measured faster". Structural wins can be reasoned about. Runtime wins must be timed, especially when work shifts across passes rather than disappearing.
Architecture Rules
Functional Core, Imperative Shell
Structure every system as:
[ Pure logic ] → produces descriptions of work
[ Thin shell ] → executes side effects (GPU commands, I/O, allocation)
The pure core is testable without a GPU. The shell is a thin translation layer. This is not optional — it is the primary architectural invariant.
For hot paths, prefer caller-provided outputs over fresh allocations:
fn collect_draw_calls(scene: &Scene, camera: &Camera, out: &mut Vec<DrawCall>) {
out.clear();
out.extend(scene.visible_entities(camera).map(DrawCall::from_entity));
}
Returning a new Vec is acceptable on cold paths. Repeated frame-time allocation is not.
Example — Render pipeline:
// PURE: describes what to draw (no GPU types)
fn plan_draw_calls(scene: &Scene, camera: &Camera) -> Vec<DrawCall> { ... }
// SHELL: translates to GPU commands (side-effectful, thin)
fn execute_draw_calls(
encoder: &mut wgpu::CommandEncoder,
calls: &[DrawCall],
resources: &GpuResources,
) { ... }
Separation of Concerns via Composition
Each module owns exactly one responsibility. Compose via function arguments and return values, never via shared mutable state.
input → parse → validate → transform → describe_effects → execute
Every arrow is a pure function except the last. If a function both computes and side-effects, split it.
Small Functions, Big Inlining
Functions of 3–10 lines are the sweet spot. The compiler inlines aggressively when functions are small and generic. This means small functions are often faster than large monolithic ones — the optimizer sees through them.
Mark hot-path helpers #[inline] or #[inline(always)] only after profiling
confirms a cross-crate boundary blocks inlining. Within a crate, trust LLVM.
Idiomatic Patterns
Read references/patterns.md for detailed code patterns covering:
- Type-state pattern for compile-time pipeline validation
- Zero-cost iterator pipelines vs manual loops
- Algebraic error handling (no panics in library code)
- Arena allocation and frame-scoped temporaries
- SoA (Struct of Arrays) for cache-friendly GPU data
- Newtype pattern for unit safety (WorldPos vs ScreenPos)
- Builder pattern for complex GPU resource creation
Read references/wgpu-idioms.md for wgpu-specific patterns covering:
- Bind group layout composition
- Render pass and compute pass structure
- Host/shader contract review
- Buffer management and staging patterns
- Pipeline caching and lazy initialization
- WGSL shader module organization
- Surface configuration and resize handling
Read references/perf-checklist.md before finalizing any code for:
- Allocation audit
- Cache-line analysis
- Branch elimination opportunities
- Parallelism and async considerations
Read references/examples.md for full before/after refactoring walkthroughs.
Use these as templates when reviewing or restructuring user code.
Scripts
Run .github/skills/rust-wgpu-functional/scripts/generate_layout_tests.py to auto-generate memory layout tests.
Annotate structs with /// @layout size=32 align=4 (or cache_line_fit,
gpu_aligned) and the script produces #[cfg(test)] assertions for size,
alignment, and field offsets. Run it after any struct change to catch silent
layout regressions.
python .github/skills/rust-wgpu-functional/scripts/generate_layout_tests.py src/types.rs > src/types_layout_tests.rs
python .github/skills/rust-wgpu-functional/scripts/generate_layout_tests.py src/ --recursive > tests/layout_tests.rs
Layout tests are necessary but not sufficient for Rust/WGSL interop. Also verify:
- Rust
BindGroupLayoutEntryvisibility and read/write mode matches WGSL declarations - Rust bind group entries target the same binding indices as the shader
- one pass owns each read-write aggregate output such as indirect args or scan totals
- Rust dispatch sizing agrees with shader-side bounds guards and scan hierarchy limits
Treat the pipeline layout, bind group entries, and WGSL @group/@binding declarations as
one contract and review them together.
GPU Contract Rules
Host/Shader ABI Is a Single Unit
For any buffer shared with WGSL:
- use
#[repr(C)]andbytemuck::Pod + Zeroableon Rust structs - keep field order identical between Rust and WGSL
- validate size and alignment after every structural change
- prefer explicit padding fields over implicit layout assumptions
If a struct exists only in WGSL but controls Rust behavior, such as an indirect-draw struct, document its byte layout next to the Rust buffer creation code.
Bindings Must Be Reviewed as a Triple
Every binding change must keep WGSL @group/@binding, Rust BindGroupLayoutEntry, and Rust
BindGroupEntry aligned. See references/wgpu-idioms.md §4 for the full review checklist.
Multi-Pass Compute Needs Explicit Ownership
In culling, scan, compaction, and indirect-draw pipelines:
- each pass should own exactly one transformation of data
- one pass writes each aggregate output; later passes only consume it
- encode scan hierarchy limits as Rust assertions near buffer sizing and dispatch math
- prefer local recomputation over oversized intermediate buffers when it reduces bandwidth
WGSL Safety Rules
- Every compute shader must guard storage-buffer indexing with runtime bounds checks.
arrayLength()is for bounds and derived limits, not wishful assumptions about allocations.- Prefix scans require the full read-barrier-write-barrier sequence at every step.
- Avoid atomics when staged scan/compaction removes global contention at scale.
Profiling and Verification
Optimize by proof, not by narrative.
- Separate CPU planning, compute, draw, and composite timings.
- Timestamp writes must bracket actual work, not placeholders.
- Compare before/after at fixed camera states and fixed quality settings.
- If a change removes duplicated shader math, that is a structural win.
- If a change adds passes, treat it as unproven until timings show the tradeoff is favorable.
- State uncertainty plainly when measurement is still missing.
See references/perf-checklist.md §5 for GPU-specific verification steps.
Type System as Performance Tool
Algebraic Data Types — Model the Domain Exactly
// The GPU pipeline has exactly these states. No invalid transitions possible.
enum PipelineState {
Uninitialized,
Compiling { shader_source: ShaderSource },
Ready { pipeline: wgpu::RenderPipeline },
Error { reason: PipelineError },
}
Match exhaustively. Never use _ => wildcard on domain enums — you want
the compiler to break your code when you add a variant.
Generics Over Dynamic Dispatch
// YES — monomorphized, zero-cost, inlined
fn process_vertices<V: Vertex>(vertices: &[V]) -> BoundingBox { ... }
// AVOID on hot paths — vtable indirection, no inlining
fn process_vertices(vertices: &[dyn Vertex]) -> BoundingBox { ... }
Use dyn Trait only for heterogeneous collections on cold paths (plugin
systems, configuration, error handling).
Const Generics for Compile-Time Dimensions
struct Matrix<const ROWS: usize, const COLS: usize> {
data: [[f32; COLS]; ROWS],
}
// Multiplication is only valid when dimensions align — compile-time checked
impl<const M: usize, const N: usize, const P: usize>
std::ops::Mul<Matrix<N, P>> for Matrix<M, N>
{
type Output = Matrix<M, P>;
fn mul(self, rhs: Matrix<N, P>) -> Matrix<M, P> { ... }
}
Iterator Pipelines — Declarative and Fast
Rust iterators often compile to the same code as hand-written loops. Prefer them when they preserve clarity and let LLVM fuse the work. Switch to explicit loops when the iterator form obscures data movement, blocks output reuse, or prevents obvious vectorization.
// Declarative, fused, zero-allocation, auto-vectorizable
let visible: Vec<DrawCall> = entities.iter()
.filter(|e| frustum.contains(e.bounds()))
.sorted_unstable_by_key(|e| e.material_id()) // requires itertools::Itertools
.map(|e| e.to_draw_call(camera))
.collect();
When iterators cannot express the pattern (e.g., parallel mutation of
disjoint slices), drop to split_at_mut or chunks_exact_mut — never
to raw indexing with bounds checks disabled.
Error Handling — Total Functions, No Panics
// Every function in library code returns Result or is infallible.
// Panics are reserved for: tests, main(), and proven-unreachable invariants.
// Domain errors are algebraic — not stringly typed
#[derive(Debug, thiserror::Error)]
enum ShaderError {
#[error("compilation failed: {diagnostic}")]
Compilation { diagnostic: String, line: u32 },
#[error("unsupported feature: {feature}")]
Unsupported { feature: &'static str },
}
// The ? operator composes errors through pure function chains
fn load_pipeline(device: &wgpu::Device, source: &str) -> Result<Pipeline, ShaderError> {
let module = compile_shader(source)?;
let layout = derive_layout(&module)?;
Ok(build_pipeline(device, &module, &layout))
}
Code Generation Checklist
When generating Rust code with this skill, verify every output against:
- No unnecessary allocations — Can this Vec be an iterator? Can this String be &str?
- No clone() without justification — Each clone gets a
// clone: <reason>comment - No unwrap() in non-test code — Use
?,map,and_then, orunwrap_or - Exhaustive matching — No
_ =>on domain enums - Pure functions dominate — Side effects only at boundaries
- Types encode invariants — If it compiles, it's valid
- Iterators over indexing — Unless disjoint mutation requires slices
- Borrowing over owning — Own only at async/thread/storage boundaries
- Generics over dyn — On hot paths, always monomorphize
- Const over let — If it's known at compile time, make it const
- Rust/WGSL contract verified — Layouts, bindings, visibility, and read/write modes match
- Optimization claim is evidenced — structural proof or timing data, clearly labeled
Naming Conventions
- Types:
PascalCase, descriptive noun —VertexBuffer,RenderPassDescriptor - Functions:
snake_case, verb-first —build_pipeline,compute_normals - Pure functions returning bool:
is_orhas_prefix —is_visible,has_transparency - Constructors:
new,with_capacity,from_* - Conversions:
to_*(owned),as_*(borrowed),into_*(consuming) - Unsafe wrappers:
*_unchecked— and always document the safety invariant
Refactoring Examples
Read references/examples.md for full before/after transformations showing:
- OOP entity hierarchy → data-oriented SoA with pure systems
- Monolithic render function → functional core / imperative shell split
- Runtime type checks → type-state compile-time enforcement
- Scattered allocations → arena-scoped frame temporaries
- Stringly-typed errors → algebraic domain errors with monadic chaining
Each example shows the original code, explains what's wrong, and walks through the refactored version with performance annotations.
When NOT to Apply Functional Purity
Some patterns legitimately require mutation. Don't fight these — contain them:
- ECS update loops — Mutate components in-place, but keep system functions pure in their logic (take input components, produce output components).
- GPU buffer writes —
queue.write_bufferis inherently side-effectful. Prepare the data purely, execute the write in the shell. - Frame allocators / arenas — Bump allocation is mutable by nature. Wrap it, expose an immutable borrowing API, reset at frame boundaries.
- Hot inner loops — If a profiler shows that an iterator chain prevents
auto-vectorization in a specific case, drop to a
forloop with raw slice access. Comment why. This should be rare (<5% of code).