Date: 2026-03-29
Increase end-to-end speed via Rust hardware modules outside crawler internals. Do not depend on ngram systems for wins. Bandit is current SOTA reference while crawler-only leg is rebuilt.
| Sprint | Modules | Goal | Gate |
|---|---|---|---|
| A | NR-01 + NR-02 | Remove Python data path bottlenecks and overlap H2D transfers | >=10% throughput gain, no metric regression |
| B | NR-03 | Accelerate sliding-window eval infra | >=25% eval wallclock reduction |
| C | NR-04 | Compress/export faster with deterministic pack pipeline | >=2x export speedup, bit-exact roundtrip |
| D | NR-05 | Reduce launch overhead with CUDA graph replay | >=10% train step reduction |
| E | NR-06 | Stabilize topology-level performance | lower p95 step jitter and +3% throughput |
| F | NR-07 | Online parameter tuning | additional >=5% gain over Sprint E |
- Every sprint ships with A/B benchmark evidence.
- No sprint proceeds if parity checks fail.
- Any speed gain that harms baseline quality beyond tolerance is rejected.
Use experiments/Crawler_Leg_1/run.sh profile with:
NGRAM_EVAL_ORDER=0USE_CRAWLER=1NUM_FLAT_LAYERS=4NUM_CRAWLER_LAYERS=1CRAWLER_LOOPS=4INST_DIM=32CRAWLER_QUANT_INT8=1DELTA_NET_HEADS=0SKIP_EMA=1SKIP_GPTQ=1- fixed seed and wallclock
Execute Sprint A/B/C on crawler-only lane:
- Keep
nitrust-pyimport path optional with strict parity checks. - Benchmark Rust mmap + pinned batcher on crawler-only ablation grid.
- Add eval/export Rust path tests only after crawler baseline is stable across two seeds.