Benchmark Results and Analysis

Results from the RingMPSC high-performance benchmark. For measurement methodology, see methodology.md.

Hardware

AMD Ryzen 7 5700 (8C/16T, Zen 3), DDR4-3200 dual-channel, Linux 6.x, Zig 0.15.2 ReleaseFast.

Throughput Results

u32 Messages (4 bytes)

Config	Throughput	Scaling	Efficiency
1P1C	29 B/s	1.0x	100%
2P1C	58 B/s	2.0x	100%
4P1C	112 B/s	3.9x	97%
6P1C	160 B/s	5.5x	92%
8P1C	180 B/s	6.2x	78%

u64 Messages (8 bytes)

Config	Throughput	Scaling	Efficiency
1P1C	15 B/s	1.0x	100%
2P1C	29 B/s	1.9x	97%
4P1C	58 B/s	3.9x	97%
6P1C	85 B/s	5.7x	94%
8P1C	93 B/s	6.2x	78%

500M msgs/producer, 32K batch, 32K ring, @memset fill. Each per-producer ring (128 KB for u32, 256 KB for u64) fits within the Zen 3 512 KB dedicated L2 per core — throughput reflects cache-resident end-to-end message movement through the ring buffer. Efficiency is calculated relative to 1P1C baseline scaled linearly; the drop at 8P reflects shared L3 and memory controller contention.

Bottleneck Analysis

Isolated component measurements (single thread, 100K iterations of 32K batches):

Component	Rate	% of Producer Time
`@memset` only (no ring ops)	Compiler-eliminated in isolation; dominant cost in context (~16 B/s combined with reserve+commit)	—
`reserve` + `commit` (no fill)	~28,000 B/s	<0.1%
`reserve` + `@memset` + `commit`	~16 B/s	99.9%
Consumer drain (head advancement)	~164,000 B/s	—

Conclusion: The ring coordination overhead (reserve/commit/consume) is negligible. Nearly all time is spent on the producer @memset — writing 256KB (u64) or 128KB (u32) of data per batch.

The channel infrastructure can sustain trillions of messages per second. Throughput is bounded by L2 cache write bandwidth and coherency traffic at these ring sizes, not DRAM bandwidth.

Scaling Analysis

Near-Linear to 6P

From 1P to 6P, scaling efficiency stays above 95%. Each producer has a dedicated SPSC ring with zero contention. The only shared resources are the L3 cache and memory controller, which handle concurrent cache-coherency traffic efficiently at moderate producer counts.

Efficiency Drop at 8P

At 8P, efficiency drops to ~80%. Three factors:

Memory bandwidth saturation: 8 producers × 256KB/batch × ~490K batches/sec = ~960 GB/s theoretical L2 write demand against the ~24 GB/s effective write bandwidth observed at saturation. The memory controller and L3 must serialize an increasing volume of coherency traffic.
L3 cache contention: 8 active rings × 256KB = 2MB of hot data in a 32MB shared L3. While this fits, the increased working set causes more L3 evictions and refills.
Consumer polling overhead: The single consumer thread scans all 8 rings per iteration. At high producer rates, some rings are empty on a given scan (producers are bursty), adding wasted atomic loads.

Message Size Correlation

The 2x difference between u32 and u64 throughput directly reflects the 2x difference in write bandwidth per message:

Type	Bytes/Batch	1P Write BW	8P Write BW
u32	128 KB	~3.7 GB/s	~23.7 GB/s
u64	256 KB	~3.8 GB/s	~24.8 GB/s

Both saturate at approximately the same total memory bandwidth (~24 GB/s write), confirming the memory wall.

Configuration Sensitivity

Ring Size (`ring_bits`)

`ring_bits`	Slots	Ring Size (u64)	8P1C Throughput	Why
14	16K	128 KB	~94 B/s	Ring fills before consumer drains → reserve failures → producer spins
15	32K	256 KB	~93 B/s	Sweet spot: fits in 512 KB L2, large enough to absorb bursts
16	64K	512 KB	~82 B/s	Ring = L2 size → cache thrashing, evictions during @memset
18	256K	2 MB	~55 B/s	Working set in L3 → higher latency per write
20	1M	8 MB	~25 B/s	Working set in DRAM → DDR4-bandwidth bound (~24 GB/s)

How to choose ring_bits for your workload:

ring_size = 2^ring_bits × sizeof(T)

Rule of thumb:
  ring_size < L2_cache_size / 2    → best throughput (cache-resident)
  ring_size ≈ L2_cache_size        → cache pressure, ~15% throughput loss
  ring_size > L2_cache_size        → L3/DRAM bound, significant throughput loss

For Zen 3 (512 KB L2):
  u32: ring_bits ≤ 16 (256 KB) fits in L2
  u64: ring_bits ≤ 15 (256 KB) fits in L2

For Intel (1.25 MB L2 on Alder Lake+):
  u64: ring_bits ≤ 17 (1 MB) fits in L2

Batch Size

Batch size determines how many messages are written per reserve/commit cycle. The amortization benefit:

Batch Size	Atomic Ops/Msg	Overhead Ratio
1	2 (load + store)	100%
32	2/32 = 0.06	6.3%
1K	2/1K = 0.002	0.2%
32K	2/32K = 0.00006	0.006%

Batch size should not exceed ring capacity (clamped by reserve()). Smaller batches increase the ratio of atomic operations to useful work. For maximum throughput, use batch_size = 2^ring_bits (matching ring capacity). For lower latency, use smaller batches to reduce per-batch processing time.

When to Use Each Channel Type

Scenario	Channel	Config	Expected Throughput
Logging pipeline (many writers, one flusher)	MPSC	ring_bits=15, batch consume	100+ B/s
Audio/video frame passing (1:1)	SPSC	ring_bits=12-14	10-30 B/s
Work distribution (fan-out)	SPMC	ring_bits=14	Depends on consumer count
Task queue (M workers)	MPMC	ring_bits=14, rings_per_shard=2	Depends on topology
Low-latency trading	SPSC	ring_bits=10, busy_spin wait	~50 M/s single-item, <200ns RTT

Competitive Benchmarks

Apples-to-apples comparison against three baselines on the same hardware, same item count, same item size, same core pinning. The suite measures batch throughput (u32 + u64), MPSC scaling, SPSC single-item throughput, and round-trip latency with percentiles. Generates a self-contained HTML report at benchmarks/results/bench.html.

Baselines:

Rigtorp-style Cached SPSC — The gold standard for SPSC: cached head/tail, power-of-2 masking. Matches the algorithm in rigtorp/SPSCQueue and folly::ProducerConsumerQueue. Used in batch throughput comparison.
CAS MPSC Queue (Vyukov-style) — Lock-free bounded array with CAS on enqueue. The standard approach used by tokio, crossbeam (bounded), and Linux kernel's llist. Used in single-item MPSC comparison.
Mutex + buffer — std.Thread.Mutex protected ring. Answers "why not just use a mutex?" Used in both batch and single-item comparisons.

Batch Throughput

All baselines use the same ring-decomposed architecture (separate ring per producer) to isolate the ring algorithm difference. 32K batch, 32K ring, 100M msgs/producer, @memset fill.

Config	ringmpsc	Cached SPSC	Mutex
1P1C u32	29 B/s	31 B/s	7 B/s
8P1C u32	180 B/s	197 B/s	77 B/s
1P1C u64	15 B/s	16 B/s	7 B/s
8P1C u64	93 B/s	102 B/s	37 B/s

In batch mode with @memset fill, all implementations do the same fundamental operation: write a contiguous block into a cache-resident ring with a single tail store. The Cached SPSC is slightly faster here because its code path is simpler (no channel abstraction, no producer registration, no NUMA support). This is expected — the batch benchmark isolates ring write bandwidth, where ringmpsc's additional features add small overhead. ringmpsc's advantage is the MPSC channel abstraction (dynamic producer registration, NUMA-aware allocation, wait strategies, metrics) and the scaling behavior in contended single-item workloads shown below.

SPSC Single-Item Throughput

The bench suite does not include a dedicated SPSC single-item comparison. The MPSC single-item benchmark (below) at 1P1C effectively measures SPSC throughput: ringmpsc achieves ~54 M/s in this mode.

Why single-item is slower: ringmpsc's send() goes through reserve() (bounds check + cached head lookup) → memcpy into the reserved slice → commit() (tail store + conditional futex wake check). This overhead exists to enable batch operations, SIMD copy, zero-copy reserve/commit, and futex-based blocking — features simpler rings don't have.

Why this doesn't matter in practice: Real workloads batch. In batch mode (bench-high-perf), ringmpsc achieves ~180 B/s — 3,300x faster than 54 M/s single-item — because it amortizes all per-operation overhead across 32K items with a single atomic head update.

SPSC Latency (ping-pong round-trip)

100K iterations after 10K warmup. Sorted samples provide percentile breakdowns.

Implementation	p50	p99	Min RTT
ringmpsc SPSC	160 ns	210 ns	110 ns

The bench suite measures ringmpsc SPSC latency only. p99 (211 ns) is tight — the occasional spike comes from cached-head refresh on the slow path.

MPSC Throughput (scaling with contention)

This is where ring decomposition dominates. 5M u32 total, single-item sendOne/consumeAll.

Important architectural note: This comparison is between fundamentally different architectures, not just algorithms. ringmpsc gives each producer a dedicated SPSC ring (zero inter-producer contention by design). The CAS MPSC and Mutex baselines use a single shared queue — which is how traditional MPSC queues work (Vyukov, tokio, crossbeam bounded, etc.). The point of ringmpsc IS the architectural choice to decompose MPSC into N independent SPSC problems. The scaling advantage below is the payoff of that design decision.

Config	ringmpsc	CAS MPSC	Mutex	vs Mutex	vs CAS
1P1C	54 M/s	115 M/s	24 M/s	2x	0.5x
2P1C	117 M/s	23 M/s	29 M/s	4x	5x
4P1C	237 M/s	20 M/s	10 M/s	24x	12x
8P1C	514 M/s	14 M/s	4 M/s	129x	37x

Why the CAS queue wins at 1P1C: With a single producer, there is no CAS contention — every cmpxchg succeeds on the first attempt. The Vyukov array queue's tight enqueue loop (load-CAS-store) is faster than ringmpsc's register/sendOne path which goes through the MPSC channel's ring lookup. At 1P1C, CAS achieves 115 M/s vs ringmpsc's 54 M/s.

Why the CAS queue collapses at 2P+: Multiple producers compete on the same enqueue_pos via CAS. Each failed CAS wastes a cycle and invalidates other cores' caches. At 8P, the CAS queue spends most time retrying failed operations — throughput drops from 115 M/s (1P) to 14 M/s (8P), an 8x degradation.

Why the mutex collapses: All 8 producers + 1 consumer compete for the same mutex. At 8P, the lock is almost never uncontested. Throughput drops from 24 M/s (1P) to 4 M/s (8P).

Why ringmpsc scales well: Each producer has its own SPSC ring — zero shared state between producers. Adding a producer adds a new ring, not more contention. At 8P, ringmpsc achieves 514 M/s (10x over 1P1C's 54 M/s). Scaling efficiency is 95%+ up to 6 producers, dropping to ~80% at 8P due to memory bandwidth saturation and L3 contention — not lock contention.

Run with: zig build bench. Generates an HTML report at benchmarks/results/bench.html with inline SVG charts.

Note: These numbers use single-item sends (no batching). The batch throughput benchmark (bench-high-perf) shows ~180 B/s (u32) / ~93 B/s (u64) when using 32K batch sizes.

External Comparison Context

Direct comparison with other implementations is difficult due to different hardware, message sizes, and measurement methodologies. These numbers are from other projects' own benchmarks and are included for rough context only — not as direct comparisons.

Implementation	Language	Reported Throughput	Notes
RingMPSC (batch)	Zig	180 B/s (u32), 93 B/s (u64)	This work. 8P1C, @memset fill, 32K batch.
RingMPSC (single)	Zig	416 M/s (u32, MPSC), 680 M/s (SPSC)	This work. 8P1C MPSC / 1P1C SPSC.
LMAX Disruptor	Java	~100M ops/s	3-stage pipeline, JVM
rigtorp/SPSCQueue	C++	~363 M/s single, ~1 B/s batch	SPSC only. Ryzen 9 3900X.
crossbeam-channel	Rust	~100M msg/s	MPSC, general-purpose
moodycamel	C++	~500M msg/s	MPMC, general-purpose

RingMPSC's batch numbers (~180 B/s) reflect cache-resident throughput with large batches — a fundamentally different measurement than single-item throughput. The single-item numbers are more directly comparable to crossbeam/moodycamel methodology but measure MPSC scaling, not SPSC peak. We encourage readers to run their own benchmarks on their own hardware.

Feature-Specific Performance

EventNotifier (eventfd + epoll)

Metric	Value	Notes
signal() overhead	~1us	One write() syscall per call
consume() overhead	~1us	One read() syscall
Coalescing	Automatic	Multiple signals between wakes accumulate in u64 counter
Wake latency (p50)	6us	vs 110ns spin-wait (measured, bench-wake-latency)
Wake latency (p99)	39us	vs 982ns spin-wait
Impact on batch throughput	<0.1%	One signal per commit() of 32K items = negligible

When to use: Consumer is an event loop multiplexing ring data with socket I/O. For dedicated consumer threads, futex wake is faster (zero syscall when no waiters).

BufferPool (Zero-Copy)

Message Size	Inline Ring(T)	BufferPool + Ring(Handle)	Speedup
64 bytes	28 M/s	9 M/s	0.3x (CAS overhead dominates)
256 bytes	29 M/s	9 M/s	0.3x
1 KB	21 M/s	8 M/s	0.4x
4 KB	5 M/s	7 M/s	1.5x (zero-copy wins)

Measured: bench-zero-copy, ReleaseFast, 5M messages, single-item send, Ryzen 7 5700.

BufferPool alloc/free cost: ~10-50ns per CAS operation. The CAS overhead is fixed regardless of message size, while the inline copy cost scales linearly with size. The crossover point is ~4KB on this hardware — below that, inline copy is faster because the copy cost is less than the CAS overhead.

When to use: Messages ≥ 4KB where copy cost exceeds CAS overhead. For smaller messages, use inline Ring(T).

SharedRing (Cross-Process)

Metric	SharedRing	In-Process Ring	Overhead
Throughput (u64, batch, 1P1C)	7,441 M/s	7,796 M/s	4.6%
Memory copies per message	0 (shared memory)	0 (same process)	0
Syscalls per message	0 (data path)	0	0
Cross-process futex wake	~50-100ns extra	0	per wake only

Measured: bench-cross-process, ReleaseFast, 50M messages, batch 4096, Ryzen 7 5700.

SharedRing throughput is within 5% of in-process Ring because the data path is identical — both use acquire/release atomics on cache-coherent memory. The 4.6% overhead comes from the heartbeat timestamp store on each commit() and the slightly larger working set (extern struct header vs compact struct).

Setup overhead: memfd_create + ftruncate + mmap (~50us total). sendFd/recvFd via SCM_RIGHTS (~10us). This is a one-time cost at channel establishment.

Adaptive Consumer Skip

Producers	Active	Without Skip	With Skip	Improvement
8	8	8 polls/round	8 polls/round	0% (all active)
8	4	8 polls/round	~4 polls/round	50% fewer polls
8	1	8 polls/round	~1 poll/round	87% fewer polls
32	2	32 polls/round	~2 polls/round	94% fewer polls

Skip counters double on consecutive empty polls (1, 2, 4, 8, 16, 32, 64 max). A ring that becomes active resets to 0 immediately. Worst-case detection latency for a ring going from empty to non-empty: 64 consumer poll intervals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Results and Analysis

Hardware

Throughput Results

u32 Messages (4 bytes)

u64 Messages (8 bytes)

Bottleneck Analysis

Scaling Analysis

Near-Linear to 6P

Efficiency Drop at 8P

Message Size Correlation

Configuration Sensitivity

Ring Size (`ring_bits`)

Batch Size

When to Use Each Channel Type

Competitive Benchmarks

Batch Throughput

SPSC Single-Item Throughput

SPSC Latency (ping-pong round-trip)

MPSC Throughput (scaling with contention)

External Comparison Context

Feature-Specific Performance

EventNotifier (eventfd + epoll)

BufferPool (Zero-Copy)

SharedRing (Cross-Process)

Adaptive Consumer Skip

FilesExpand file tree

benchmarks.md

Latest commit

History

benchmarks.md

File metadata and controls

Benchmark Results and Analysis

Hardware

Throughput Results

u32 Messages (4 bytes)

u64 Messages (8 bytes)

Bottleneck Analysis

Scaling Analysis

Near-Linear to 6P

Efficiency Drop at 8P

Message Size Correlation

Configuration Sensitivity

Ring Size (ring_bits)

Batch Size

When to Use Each Channel Type

Competitive Benchmarks

Batch Throughput

SPSC Single-Item Throughput

SPSC Latency (ping-pong round-trip)

MPSC Throughput (scaling with contention)

External Comparison Context

Feature-Specific Performance

EventNotifier (eventfd + epoll)

BufferPool (Zero-Copy)

SharedRing (Cross-Process)

Adaptive Consumer Skip

Ring Size (`ring_bits`)