A demo of GPU prefix-sum (scan) implementations in Rust using wgpu, with multiple algorithms and a small benchmark harness for comparisons.
These implementations are for demonstration purposes only and do not support arrays of arbitrary length, overflow handling or any other edge cases. That means these implementations are not suitable for practical use, although they might be a good starting point for your implementation.
You can find more details in the below article.
https://yayo1.com/en/blog/webgpu-prefix-sum/
Japanese version is here.
https://zenn.dev/yayo1/articles/1273ec6ac3bc17
Run on Mac mini M4 Pro 24GB with 16 core GPU.
- CPU baseline: sequential inclusive prefix sum (
src/cpu_prefix_scan.rs). - GPU Hillis-Steele scan (inclusive) with double buffers (
src/hillis_steele_scan.rs). - GPU Blelloch scan (exclusive) in two forms:
- On global memory (
src/global_blelloch_scan.rs). - Blocked scan using shared memory (
src/block_blelloch_scan.rs).
- On global memory (
- GPU subgroup scan (exclusive) using subgroup operations (
src/subgroup_scan.rs).
WGSL shaders for each GPU path in src/*.wgsl.
- Rust toolchain with 2024 edition support.
- A GPU/driver that supports
wgpucompute and theSUBGROUPfeature (required by the subgroup scan).
Criterion benchmarks compare all implementations across powers of two from 2^1 to 2^29.
cargo benchThe benchmark entry point is benches/bench.rs.
- The Hillis-Steele implementation produces an inclusive scan.
- The Blelloch and subgroup implementations produce exclusive scans.
- GPU implementations assume the input length is a power of two.
