Skip to content

Latest commit

 

History

History
262 lines (176 loc) · 16 KB

File metadata and controls

262 lines (176 loc) · 16 KB

Benchmark Results and Analysis

Results from the RingMPSC high-performance benchmark. For measurement methodology, see methodology.md.

Hardware

AMD Ryzen 7 5700 (8C/16T, Zen 3), DDR4-3200 dual-channel, Linux 6.x, Zig 0.15.2 ReleaseFast.

Throughput Results

u32 Messages (4 bytes)

Config Throughput Scaling Efficiency
1P1C 29 B/s 1.0x 100%
2P1C 58 B/s 2.0x 100%
4P1C 112 B/s 3.9x 97%
6P1C 160 B/s 5.5x 92%
8P1C 180 B/s 6.2x 78%

u64 Messages (8 bytes)

Config Throughput Scaling Efficiency
1P1C 15 B/s 1.0x 100%
2P1C 29 B/s 1.9x 97%
4P1C 58 B/s 3.9x 97%
6P1C 85 B/s 5.7x 94%
8P1C 93 B/s 6.2x 78%

500M msgs/producer, 32K batch, 32K ring, @memset fill. Each per-producer ring (128 KB for u32, 256 KB for u64) fits within the Zen 3 512 KB dedicated L2 per core — throughput reflects cache-resident end-to-end message movement through the ring buffer. Efficiency is calculated relative to 1P1C baseline scaled linearly; the drop at 8P reflects shared L3 and memory controller contention.

Bottleneck Analysis

Isolated component measurements (single thread, 100K iterations of 32K batches):

Component Rate % of Producer Time
@memset only (no ring ops) Compiler-eliminated in isolation; dominant cost in context (~16 B/s combined with reserve+commit)
reserve + commit (no fill) ~28,000 B/s <0.1%
reserve + @memset + commit ~16 B/s 99.9%
Consumer drain (head advancement) ~164,000 B/s

Conclusion: The ring coordination overhead (reserve/commit/consume) is negligible. Nearly all time is spent on the producer @memset — writing 256KB (u64) or 128KB (u32) of data per batch.

The channel infrastructure can sustain trillions of messages per second. Throughput is bounded by L2 cache write bandwidth and coherency traffic at these ring sizes, not DRAM bandwidth.

Scaling Analysis

Near-Linear to 6P

From 1P to 6P, scaling efficiency stays above 95%. Each producer has a dedicated SPSC ring with zero contention. The only shared resources are the L3 cache and memory controller, which handle concurrent cache-coherency traffic efficiently at moderate producer counts.

Efficiency Drop at 8P

At 8P, efficiency drops to ~80%. Three factors:

  1. Memory bandwidth saturation: 8 producers × 256KB/batch × ~490K batches/sec = ~960 GB/s theoretical L2 write demand against the ~24 GB/s effective write bandwidth observed at saturation. The memory controller and L3 must serialize an increasing volume of coherency traffic.

  2. L3 cache contention: 8 active rings × 256KB = 2MB of hot data in a 32MB shared L3. While this fits, the increased working set causes more L3 evictions and refills.

  3. Consumer polling overhead: The single consumer thread scans all 8 rings per iteration. At high producer rates, some rings are empty on a given scan (producers are bursty), adding wasted atomic loads.

Message Size Correlation

The 2x difference between u32 and u64 throughput directly reflects the 2x difference in write bandwidth per message:

Type Bytes/Batch 1P Write BW 8P Write BW
u32 128 KB ~3.7 GB/s ~23.7 GB/s
u64 256 KB ~3.8 GB/s ~24.8 GB/s

Both saturate at approximately the same total memory bandwidth (~24 GB/s write), confirming the memory wall.

Configuration Sensitivity

Ring Size (ring_bits)

ring_bits Slots Ring Size (u64) 8P1C Throughput Why
14 16K 128 KB ~94 B/s Ring fills before consumer drains → reserve failures → producer spins
15 32K 256 KB ~93 B/s Sweet spot: fits in 512 KB L2, large enough to absorb bursts
16 64K 512 KB ~82 B/s Ring = L2 size → cache thrashing, evictions during @memset
18 256K 2 MB ~55 B/s Working set in L3 → higher latency per write
20 1M 8 MB ~25 B/s Working set in DRAM → DDR4-bandwidth bound (~24 GB/s)

How to choose ring_bits for your workload:

ring_size = 2^ring_bits × sizeof(T)

Rule of thumb:
  ring_size < L2_cache_size / 2    → best throughput (cache-resident)
  ring_size ≈ L2_cache_size        → cache pressure, ~15% throughput loss
  ring_size > L2_cache_size        → L3/DRAM bound, significant throughput loss

For Zen 3 (512 KB L2):
  u32: ring_bits ≤ 16 (256 KB) fits in L2
  u64: ring_bits ≤ 15 (256 KB) fits in L2

For Intel (1.25 MB L2 on Alder Lake+):
  u64: ring_bits ≤ 17 (1 MB) fits in L2

Batch Size

Batch size determines how many messages are written per reserve/commit cycle. The amortization benefit:

Batch Size Atomic Ops/Msg Overhead Ratio
1 2 (load + store) 100%
32 2/32 = 0.06 6.3%
1K 2/1K = 0.002 0.2%
32K 2/32K = 0.00006 0.006%

Batch size should not exceed ring capacity (clamped by reserve()). Smaller batches increase the ratio of atomic operations to useful work. For maximum throughput, use batch_size = 2^ring_bits (matching ring capacity). For lower latency, use smaller batches to reduce per-batch processing time.

When to Use Each Channel Type

Scenario Channel Config Expected Throughput
Logging pipeline (many writers, one flusher) MPSC ring_bits=15, batch consume 100+ B/s
Audio/video frame passing (1:1) SPSC ring_bits=12-14 10-30 B/s
Work distribution (fan-out) SPMC ring_bits=14 Depends on consumer count
Task queue (M workers) MPMC ring_bits=14, rings_per_shard=2 Depends on topology
Low-latency trading SPSC ring_bits=10, busy_spin wait ~50 M/s single-item, <200ns RTT

Competitive Benchmarks

Apples-to-apples comparison against three baselines on the same hardware, same item count, same item size, same core pinning. The suite measures batch throughput (u32 + u64), MPSC scaling, SPSC single-item throughput, and round-trip latency with percentiles. Generates a self-contained HTML report at benchmarks/results/bench.html.

Baselines:

  • Rigtorp-style Cached SPSC — The gold standard for SPSC: cached head/tail, power-of-2 masking. Matches the algorithm in rigtorp/SPSCQueue and folly::ProducerConsumerQueue. Used in batch throughput comparison.
  • CAS MPSC Queue (Vyukov-style) — Lock-free bounded array with CAS on enqueue. The standard approach used by tokio, crossbeam (bounded), and Linux kernel's llist. Used in single-item MPSC comparison.
  • Mutex + bufferstd.Thread.Mutex protected ring. Answers "why not just use a mutex?" Used in both batch and single-item comparisons.

Batch Throughput

All baselines use the same ring-decomposed architecture (separate ring per producer) to isolate the ring algorithm difference. 32K batch, 32K ring, 100M msgs/producer, @memset fill.

Config ringmpsc Cached SPSC Mutex
1P1C u32 29 B/s 31 B/s 7 B/s
8P1C u32 180 B/s 197 B/s 77 B/s
1P1C u64 15 B/s 16 B/s 7 B/s
8P1C u64 93 B/s 102 B/s 37 B/s

In batch mode with @memset fill, all implementations do the same fundamental operation: write a contiguous block into a cache-resident ring with a single tail store. The Cached SPSC is slightly faster here because its code path is simpler (no channel abstraction, no producer registration, no NUMA support). This is expected — the batch benchmark isolates ring write bandwidth, where ringmpsc's additional features add small overhead. ringmpsc's advantage is the MPSC channel abstraction (dynamic producer registration, NUMA-aware allocation, wait strategies, metrics) and the scaling behavior in contended single-item workloads shown below.

SPSC Single-Item Throughput

The bench suite does not include a dedicated SPSC single-item comparison. The MPSC single-item benchmark (below) at 1P1C effectively measures SPSC throughput: ringmpsc achieves ~54 M/s in this mode.

Why single-item is slower: ringmpsc's send() goes through reserve() (bounds check + cached head lookup) → memcpy into the reserved slice → commit() (tail store + conditional futex wake check). This overhead exists to enable batch operations, SIMD copy, zero-copy reserve/commit, and futex-based blocking — features simpler rings don't have.

Why this doesn't matter in practice: Real workloads batch. In batch mode (bench-high-perf), ringmpsc achieves ~180 B/s — 3,300x faster than 54 M/s single-item — because it amortizes all per-operation overhead across 32K items with a single atomic head update.

SPSC Latency (ping-pong round-trip)

100K iterations after 10K warmup. Sorted samples provide percentile breakdowns.

Implementation p50 p99 Min RTT
ringmpsc SPSC 160 ns 210 ns 110 ns

The bench suite measures ringmpsc SPSC latency only. p99 (211 ns) is tight — the occasional spike comes from cached-head refresh on the slow path.

MPSC Throughput (scaling with contention)

This is where ring decomposition dominates. 5M u32 total, single-item sendOne/consumeAll.

Important architectural note: This comparison is between fundamentally different architectures, not just algorithms. ringmpsc gives each producer a dedicated SPSC ring (zero inter-producer contention by design). The CAS MPSC and Mutex baselines use a single shared queue — which is how traditional MPSC queues work (Vyukov, tokio, crossbeam bounded, etc.). The point of ringmpsc IS the architectural choice to decompose MPSC into N independent SPSC problems. The scaling advantage below is the payoff of that design decision.

Config ringmpsc CAS MPSC Mutex vs Mutex vs CAS
1P1C 54 M/s 115 M/s 24 M/s 2x 0.5x
2P1C 117 M/s 23 M/s 29 M/s 4x 5x
4P1C 237 M/s 20 M/s 10 M/s 24x 12x
8P1C 514 M/s 14 M/s 4 M/s 129x 37x

Why the CAS queue wins at 1P1C: With a single producer, there is no CAS contention — every cmpxchg succeeds on the first attempt. The Vyukov array queue's tight enqueue loop (load-CAS-store) is faster than ringmpsc's register/sendOne path which goes through the MPSC channel's ring lookup. At 1P1C, CAS achieves 115 M/s vs ringmpsc's 54 M/s.

Why the CAS queue collapses at 2P+: Multiple producers compete on the same enqueue_pos via CAS. Each failed CAS wastes a cycle and invalidates other cores' caches. At 8P, the CAS queue spends most time retrying failed operations — throughput drops from 115 M/s (1P) to 14 M/s (8P), an 8x degradation.

Why the mutex collapses: All 8 producers + 1 consumer compete for the same mutex. At 8P, the lock is almost never uncontested. Throughput drops from 24 M/s (1P) to 4 M/s (8P).

Why ringmpsc scales well: Each producer has its own SPSC ring — zero shared state between producers. Adding a producer adds a new ring, not more contention. At 8P, ringmpsc achieves 514 M/s (10x over 1P1C's 54 M/s). Scaling efficiency is 95%+ up to 6 producers, dropping to ~80% at 8P due to memory bandwidth saturation and L3 contention — not lock contention.

Run with: zig build bench. Generates an HTML report at benchmarks/results/bench.html with inline SVG charts.

Note: These numbers use single-item sends (no batching). The batch throughput benchmark (bench-high-perf) shows ~180 B/s (u32) / ~93 B/s (u64) when using 32K batch sizes.

External Comparison Context

Direct comparison with other implementations is difficult due to different hardware, message sizes, and measurement methodologies. These numbers are from other projects' own benchmarks and are included for rough context only — not as direct comparisons.

Implementation Language Reported Throughput Notes
RingMPSC (batch) Zig 180 B/s (u32), 93 B/s (u64) This work. 8P1C, @memset fill, 32K batch.
RingMPSC (single) Zig 416 M/s (u32, MPSC), 680 M/s (SPSC) This work. 8P1C MPSC / 1P1C SPSC.
LMAX Disruptor Java ~100M ops/s 3-stage pipeline, JVM
rigtorp/SPSCQueue C++ ~363 M/s single, ~1 B/s batch SPSC only. Ryzen 9 3900X.
crossbeam-channel Rust ~100M msg/s MPSC, general-purpose
moodycamel C++ ~500M msg/s MPMC, general-purpose

RingMPSC's batch numbers (~180 B/s) reflect cache-resident throughput with large batches — a fundamentally different measurement than single-item throughput. The single-item numbers are more directly comparable to crossbeam/moodycamel methodology but measure MPSC scaling, not SPSC peak. We encourage readers to run their own benchmarks on their own hardware.

Feature-Specific Performance

EventNotifier (eventfd + epoll)

Metric Value Notes
signal() overhead ~1us One write() syscall per call
consume() overhead ~1us One read() syscall
Coalescing Automatic Multiple signals between wakes accumulate in u64 counter
Wake latency (p50) 6us vs 110ns spin-wait (measured, bench-wake-latency)
Wake latency (p99) 39us vs 982ns spin-wait
Impact on batch throughput <0.1% One signal per commit() of 32K items = negligible

When to use: Consumer is an event loop multiplexing ring data with socket I/O. For dedicated consumer threads, futex wake is faster (zero syscall when no waiters).

BufferPool (Zero-Copy)

Message Size Inline Ring(T) BufferPool + Ring(Handle) Speedup
64 bytes 28 M/s 9 M/s 0.3x (CAS overhead dominates)
256 bytes 29 M/s 9 M/s 0.3x
1 KB 21 M/s 8 M/s 0.4x
4 KB 5 M/s 7 M/s 1.5x (zero-copy wins)

Measured: bench-zero-copy, ReleaseFast, 5M messages, single-item send, Ryzen 7 5700.

BufferPool alloc/free cost: ~10-50ns per CAS operation. The CAS overhead is fixed regardless of message size, while the inline copy cost scales linearly with size. The crossover point is ~4KB on this hardware — below that, inline copy is faster because the copy cost is less than the CAS overhead.

When to use: Messages ≥ 4KB where copy cost exceeds CAS overhead. For smaller messages, use inline Ring(T).

SharedRing (Cross-Process)

Metric SharedRing In-Process Ring Overhead
Throughput (u64, batch, 1P1C) 7,441 M/s 7,796 M/s 4.6%
Memory copies per message 0 (shared memory) 0 (same process) 0
Syscalls per message 0 (data path) 0 0
Cross-process futex wake ~50-100ns extra 0 per wake only

Measured: bench-cross-process, ReleaseFast, 50M messages, batch 4096, Ryzen 7 5700.

SharedRing throughput is within 5% of in-process Ring because the data path is identical — both use acquire/release atomics on cache-coherent memory. The 4.6% overhead comes from the heartbeat timestamp store on each commit() and the slightly larger working set (extern struct header vs compact struct).

Setup overhead: memfd_create + ftruncate + mmap (~50us total). sendFd/recvFd via SCM_RIGHTS (~10us). This is a one-time cost at channel establishment.

Adaptive Consumer Skip

Producers Active Without Skip With Skip Improvement
8 8 8 polls/round 8 polls/round 0% (all active)
8 4 8 polls/round ~4 polls/round 50% fewer polls
8 1 8 polls/round ~1 poll/round 87% fewer polls
32 2 32 polls/round ~2 polls/round 94% fewer polls

Skip counters double on consecutive empty polls (1, 2, 4, 8, 16, 32, 64 max). A ring that becomes active resets to 0 immediately. Worst-case detection latency for a ring going from empty to non-empty: 64 consumer poll intervals.