openai · newjordan · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026
diff --git a/.gitignore b/.gitignore
@@ -8,4 +8,5 @@ data/manifest.json
 data/docs_selected.jsonl
 .mypy_cache/
 .venv
-logs/
+logs/
+experiments/archive/checkpoints/
diff --git a/Nitrust/CLOWNCAR_IV_TARGET.md b/Nitrust/CLOWNCAR_IV_TARGET.md
@@ -0,0 +1,46 @@
+# Nitrust Target Lock — ClownCar_IV (Superseded)
+Date: 2026-03-27
+
+This target note is superseded by `Nitrust/MEDUSA_TARGET.md`.
+
+## Baseline Contract
+Optimization target is `experiments/ClownCar_IV`.
+
+Baseline runtime knobs (from `run.sh`):
+- `NGRAM_EVAL_ORDER=0`
+- `USE_CRAWLER=1`
+- `NUM_FLAT_LAYERS=4`
+- `NUM_CRAWLER_LAYERS=1`
+- `CRAWLER_LOOPS=4`
+- `INST_DIM=32`
+- `CRAWLER_QUANT_INT8=1`
+- `DELTA_NET_HEADS=4`
+- `SKIP_GPTQ=1`
+
+## Confirmed Code Seams (train_gpt.py)
+- Data shard parse/load: `load_data_shard` and `TokenStream`/`DistributedTokenLoader`.
+- Training hot path: `DistributedTokenLoader.next_batch`.
+- Eval hot path: `eval_val_sliding` window assembly loop.
+- Export hot path: int6 pack/compress path near final export.
+
+## Nitrust Compatibility Note
+ClownCar shard files are not raw `u16`; they contain a 256x`i32` header:
+- `magic=20240520`
+- `version=1`
+- `num_tokens` in header slot 2
+
+`nitrust-mmap-loader` has been updated to support this format with strict size checks,
+while still supporting raw `u16` legacy files.
+
+## Complexity-Ordered Knockout Plan (NGRAM-Free)
+1. NIT-A1: Swap shard reads to Rust mmap path (no model math changes).
+2. NIT-A2: Rust batch assembly + pinned host buffer handoff.
+3. NIT-B1: Rust sliding-window index builder for eval path.
+4. NIT-C1: Rust quant/export pack pipeline.
+5. NIT-D1: CUDA graph replay wrapper for fixed-shape training steps.
+
+## Success Gates
+- Primary: lower `step_avg` and higher tokens/sec at equal wallclock.
+- Quality: `final_int6_roundtrip_exact` and `final_int6_sliding_window_exact`
+  within tolerance (`val_bpb` delta <= +0.01 unless explicitly traded for speed).
+- Reproducibility: deterministic run logs with fixed seed.
diff --git a/Nitrust/COMMANDER_ORDERS.md b/Nitrust/COMMANDER_ORDERS.md
@@ -0,0 +1,43 @@
+# Nitrust Commander Orders — Crawler-Only Sprint Sequence
+Date: 2026-03-29
+
+## Command Intent
+Increase end-to-end speed via Rust hardware modules outside crawler internals.
+Do not depend on ngram systems for wins.
+Bandit is current SOTA reference while crawler-only leg is rebuilt.
+
+## Sprint Queue (In Order)
+
+| Sprint | Modules | Goal | Gate |
+|---|---|---|---|
+| A | NR-01 + NR-02 | Remove Python data path bottlenecks and overlap H2D transfers | >=10% throughput gain, no metric regression |
+| B | NR-03 | Accelerate sliding-window eval infra | >=25% eval wallclock reduction |
+| C | NR-04 | Compress/export faster with deterministic pack pipeline | >=2x export speedup, bit-exact roundtrip |
+| D | NR-05 | Reduce launch overhead with CUDA graph replay | >=10% train step reduction |
+| E | NR-06 | Stabilize topology-level performance | lower p95 step jitter and +3% throughput |
+| F | NR-07 | Online parameter tuning | additional >=5% gain over Sprint E |
+
+## Non-Negotiables
+1. Every sprint ships with A/B benchmark evidence.
+2. No sprint proceeds if parity checks fail.
+3. Any speed gain that harms baseline quality beyond tolerance is rejected.
+
+## Benchmark Baseline Spec
+Use `experiments/Crawler_Leg_1/run.sh` profile with:
+- `NGRAM_EVAL_ORDER=0`
+- `USE_CRAWLER=1`
+- `NUM_FLAT_LAYERS=4`
+- `NUM_CRAWLER_LAYERS=1`
+- `CRAWLER_LOOPS=4`
+- `INST_DIM=32`
+- `CRAWLER_QUANT_INT8=1`
+- `DELTA_NET_HEADS=0`
+- `SKIP_EMA=1`
+- `SKIP_GPTQ=1`
+- fixed seed and wallclock
+
+## Immediate Next Action
+Execute Sprint A/B/C on crawler-only lane:
+1. Keep `nitrust-py` import path optional with strict parity checks.
+2. Benchmark Rust mmap + pinned batcher on crawler-only ablation grid.
+3. Add eval/export Rust path tests only after crawler baseline is stable across two seeds.
diff --git a/Nitrust/CRAWLER_DELTA_BOOSTER_MATRIX.md b/Nitrust/CRAWLER_DELTA_BOOSTER_MATRIX.md
@@ -0,0 +1,70 @@
+# Nitrust Crawler/Delta Booster Matrix
+Date: 2026-03-27
+Target: `experiments/Medusa` (Crawler + DeltaNet)
+Scope: architecture/external-system boosters, plus Rust integration ablations
+Guardrail: NGRAM disabled for all signal tests (`NGRAM_EVAL_ORDER=0`)
+Active lane: crawler-only (`DELTA_NET_HEADS=0`) until DeltaNet sandbox re-validation succeeds
+
+Update (2026-03-29):
+- DeltaNet is quarantined from the main crawler run path pending re-validation.
+- Bandit is treated as current SOTA reference while crawler-only leg is rebuilt.
+
+## Master Hypothesis Table
+
+| ID | Area | Booster Hypothesis | Primary Knobs | Expected Win | Risk | Smoke Ready |
+|---|---|---|---|---|---|---|
+| CDB-01 | Quant Bridge | Loop-aware GPTQ (flat first, crawler second) beats one-shot GPTQ. | `LOOP_AWARE_GPTQ` | Better int6 roundtrip BPB | Cal cost | Yes |
+| CDB-02 | Quant Bridge | Keep crawler tensors int8 while flat stays int6. | `CRAWLER_QUANT_INT8` + export policy | Less loop error compounding | Size creep | Yes |
+| CDB-03 | Quant Bridge | Per-loop crawler dequant scales reduce distribution drift. | per-loop scale banks | Better loop stability | Metadata size | No |
+| CDB-04 | Quant Bridge | Skip GPTQ for flat, GPTQ only crawler+delta. | selective GPTQ groups | Faster quant + similar BPB | Flat quality drop | No |
+| CDB-05 | Delta Core | Delta head count has a sweet spot (under/over hurts). | `DELTA_NET_HEADS` sweep | Better quality/compute | Runtime cost | Yes |
+| CDB-06 | Delta Core | Delta state precision policy impacts stability. | bf16/fp16/fp32 state | Fewer drift errors | Throughput hit | No |
+| CDB-07 | Delta Core | Delta residual gate controls over-write chaos across loops. | residual gate scalar/schedule | Better convergence | Under-updating | No |
+| CDB-08 | Delta Core | Delta state norm clipping prevents runaway memory. | clip threshold | Robustness | Lost signal | No |
+| CDB-09 | Delta Core | Periodic delta state reset improves long-run conditioning. | reset cadence | More stable training | Loses long memory | No |
+| CDB-10 | Delta Core | Head-dim tensor-core alignment boosts Delta throughput. | aligned dims / head_dim | Faster kernels | Architecture constraints | No |
+| CDB-11 | Crawler Loop | Instruction bottleneck size has optimal range. | `INST_DIM` sweep | Better loop routing | Under/overfit | Yes |
+| CDB-12 | Crawler Loop | Loop-specific low-rank adapters beat fully shared core. | loop LoRA rank | BPB gain at small bytes | Params grow | No |
+| CDB-13 | Crawler Loop | Split sharing (shared attn, modulated MLP) improves regime handling. | attn shared + MLP gates | BPB gain | Complexity | No |
+| CDB-14 | Crawler Loop | Last loop partial unsharing captures final-pass specialization. | unshare depth=1 | BPB gain with low byte cost | Param creep | No |
+| CDB-15 | Crawler Loop | Dual-rate loops (heavy every 2nd loop) improve quality/compute. | heavy cadence | Better speed-quality frontier | Scheduler bugs | No |
+| CDB-16 | Crawler Loop | Adaptive loop count by confidence reduces wasted compute. | short/long bucket policy | Throughput gain | Control overhead | No |
+| CDB-17 | Crawler Loop | Loop state carry with explicit damping improves fixed-point stability. | carry decay | Better convergence | Slower adaptation | No |
+| CDB-18 | Crawler Loop | Loop dropout/stochastic depth improves shared-block generalization. | loop drop prob | Better robustness | Instability | No |
+| CDB-19 | Crawler Topology | Memory tokens across loops add persistent workspace. | memory token count | Better long context | Extra compute | No |
+| CDB-20 | Crawler Topology | Latent funnel recurrence (T->T/2 core) is superior at equal bytes. | funnel ratio | Speed or BPB gain | Complexity | No |
+| CDB-21 | Crawler Topology | Encoder/decoder depth rebalance improves compression frontier. | flat/crawler split | Better byte-efficiency | tuning overhead | Yes |
+| CDB-22 | Crawler Topology | Add tiny per-loop channel gates for activation alignment. | gate width | Better loop reuse | Small extra params | No |
+| CDB-23 | Rust Data Path | Rust mmap shard reader reduces loader stalls. | `NITRUST_ENABLE` | Step-time drop | bridge overhead | Yes |
+| CDB-24 | Rust Data Path | Strict mode catches silent Rust-path regressions early. | `NITRUST_STRICT` | Safer ops | hard fail risk | Yes |
+| CDB-25 | Rust Data Path | Pinned host batcher improves H2D overlap. | prefetch depth, pinned on/off | Throughput gain | Memory pressure | Partial |
+| CDB-26 | Rust Eval | Rust sliding-window index engine slashes eval wallclock. | window engine on/off | Faster eval | parity bugs | No |
+| CDB-27 | Rust Export | Rust quant pack pipeline accelerates `.ptz` creation. | quantpack on/off | Faster export | bit-exact risk | No |
+| CDB-28 | Runtime | CUDA graph replay cuts launch overhead on static smoke shapes. | graph on/off | Step-time drop | graph fragility | No |
+| CDB-29 | Runtime | NUMA/affinity pinning lowers p95 jitter on multi-GPU hosts. | affinity profile | Stability gain | host variance | No |
+| CDB-30 | Runtime | Online autotune for batch/prefetch finds hidden headroom. | autotune budget | extra throughput | tune noise | No |
+| CDB-31 | Scheduling | Warmdown/EMA/GPTQ ordering matters for final int6 quality. | `SKIP_EMA`, warmdown, GPTQ mode | Better end BPB | confounding effects | Yes |
+| CDB-32 | Scheduling | Distill-after-loop-aware-GPTQ may recover quantization loss. | distill flags + GPTQ mode | Better final BPB | extra time | No |
+
+## Spark Smoke Queue (v0)
+
+| Run ID | Ablation | Delta from baseline | Status |
+|---|---|---|---|
+| SMK-00 | Baseline smoke | Medusa smoke config, `NITRUST_ENABLE=0` | Completed: roundtrip `6.02582801`, sliding `5.97225220` |
+| SMK-01 | Rust loader ON | `NITRUST_ENABLE=1`, `NITRUST_STRICT=1` | Completed: roundtrip `6.02584613`, sliding `5.97228266` |
+| SMK-02 | Delta heads OFF | `DELTA_NET_HEADS=0` + Rust ON | Completed: roundtrip `4.91216360`, sliding `4.90379569` |
+| SMK-03 | Crawler int8 OFF | `CRAWLER_QUANT_INT8=0` + Rust ON | Completed: roundtrip `6.02587901`, sliding `5.97224063` |
+| SMK-04 | Instruction OFF | `INST_DIM=0` + Rust ON | Completed: roundtrip `6.00549835`, sliding `5.95337039` |
+
+## Smoke Config Contract
+- Tiny dataset clone in `/tmp/nitrust_smoke_data` (header-compatible shards)
+- Single Spark GPU smoke (`NPROC=1` style run)
+- `VAL_LOSS_EVERY=0` to avoid known step-0 eval/autograd conflict during smoke
+- Early-stop via wallclock cap + tiny iteration budget
+
+## Initial Spark Readout
+- Rust loader ON (`SMK-01`) is numerically neutral vs baseline in smoke (difference in the 1e-5 range on BPB).
+- `CRAWLER_QUANT_INT8=0` (`SMK-03`) is also neutral in this tiny smoke setup.
+- `INST_DIM=0` (`SMK-04`) slightly improved smoke BPB, but this is low-confidence at smoke scale.
+- `DELTA_NET_HEADS=0` (`SMK-02`) changed the task dynamics substantially and ran much faster; treat as topology sanity check, not a like-for-like quality verdict.
+- Artifact logs/summary captured at `results/nitrust_spark_smoke_20260327_234343/`.
diff --git a/Nitrust/HYPOTHESES.md b/Nitrust/HYPOTHESES.md
@@ -0,0 +1,63 @@
+# Nitrust Program — Hypothesis Backlog (NGRAM-Free)
+Date: 2026-03-27
+
+## Mission
+Build foundational, hardware-first architecture upgrades above the crawler line that improve:
+1. Model-only quality (`val_bpb`, no ngram mixing)
+2. Artifact efficiency (bytes at fixed or better quality)
+3. Throughput (step time / tokens-per-second)
+
+## Hard Rules (Nitrust Phase 1)
+1. Ignore all ngram paths for training and eval.
+2. Compare only model outputs (`final_int6_roundtrip`, `final_int6_sliding_window`).
+3. Keep export/legal path simple while architecture is changing.
+
+### NGRAM-Off Guardrail
+Use these defaults for all Nitrust runs unless explicitly overridden:
+- `NGRAM_EVAL_ORDER=0`
+- `NGRAM_EVAL_ADAPTIVE=0`
+- `NGRAM_DIRICHLET=0`
+- `PHRASE_CACHE=0`
+- `REGIME_TRACKER=0`
+- `NGRAM_ENTROPY_SHIFT=0`
+- `TRIGRAM=0`
+
+## Baseline First (NIT-00)
+Before every new injection, re-run a stable baseline with the exact same wallclock budget and seed policy.
+
+Success baseline record should include:
+- `step@cap`, `val_bpb@cap`
+- `final_int6_roundtrip_exact`
+- `final_int6_sliding_window_exact`
+- `Serialized model int6+*` bytes
+- step average ms
+
+---
+
+## Ordered Hypotheses (Low -> High Complexity)
+
+| ID | Complexity | Hypothesis | Architecture Injection | Hardware Rationale | Success Gate | Kill Gate |
+|---|---:|---|---|---|---|---|
+| NIT-01 | 1 | Hopper shape locking improves throughput without quality loss. | Lock dims/head dims to tensor-core-friendly multiples; remove odd shapes in recurrent path. | Fewer kernel variants, better matmul occupancy/fusion. | >=8% faster step time, `val_bpb` delta <= +0.01 | <3% speed gain or `val_bpb` worse by >0.02 |
+| NIT-02 | 2 | Loop-conditioned low-rank adapters fix shared-block regime mismatch. | Shared core stays fixed, per-loop `W_k = W + A_k B_k` (small rank). | Keeps parameter compression while giving each loop a cheap specialization path. | Better `final_int6_sliding_window` by >=0.02 at <=15% artifact growth | No quality gain or artifact growth >20% |
+| NIT-03 | 3 | Split sharing (shared attention, loop-specific MLP modulation) beats fully shared blocks. | Share attention weights; add tiny per-loop channel gates or low-rank MLP deltas. | Attention kernels stay reusable; cheap MLP modulation handles loop-specific distributions. | >=0.02 BPB improvement vs NIT-00 with <=20% slower step time | Regresses both speed and BPB |
+| NIT-04 | 4 | Bucketed adaptive loop budget improves quality-per-compute. | Two static paths: short-loop and long-loop based on confidence bucket at sequence/window level. | Preserves static-ish execution while reducing unnecessary deep passes. | Same or better BPB with >=15% faster average step time | Control overhead removes speed gain |
+| NIT-05 | 5 | Latent funnel recurrence dominates flat+bottleneck at same bytes. | Downsample sequence in bottleneck (`T -> T/2`), run recurrent core there, upsample back. | Shifts work to denser GEMMs and lowers KV bandwidth pressure. | >=0.03 BPB gain or >=20% speedup at comparable artifact size | Training instability or quality collapse |
+| NIT-06 | 6 | Persistent memory tokens make recurrence actually cumulative. | Add small memory token bank carried across loops and rewritten each loop. | Small fixed memory adds global workspace without large parameter cost. | >=0.02 BPB gain over NIT-05 with <=10% speed hit | No measurable gain after two seeds |
+| NIT-07 | 7 | Dual-rate recurrent superblock wins the frontier. | Heavy attention every 2 loops, lightweight update each loop (multi-rate core). | Cuts expensive attention frequency while keeping iterative refinement depth. | Better BPB and speed-vs-quality tradeoff than NIT-05/06 | Scheduling complexity causes compile/runtime fragility |
+
+---
+
+## Execution Order
+1. NIT-00 baseline freeze
+2. NIT-01 shape locking
+3. NIT-02 low-rank loop adapters
+4. NIT-03 split sharing
+5. NIT-04 adaptive loop buckets
+6. NIT-05 latent funnel
+7. NIT-06 memory tokens
+8. NIT-07 dual-rate superblock
+
+## Notes
+- Do not introduce ngram-dependent compensators while validating core architecture signal.
+- Any candidate that wins only with ngram is considered unproven for Nitrust Phase 1.
diff --git a/Nitrust/MEDUSA_TARGET.md b/Nitrust/MEDUSA_TARGET.md
@@ -0,0 +1,51 @@
+# Nitrust Target Lock — Crawler Mainline (Medusa Delta)
+Date: 2026-03-29
+
+## Baseline Contract
+Optimization target is crawler-only mainline:
+- Canonical launcher: `experiments/Crawler_Leg_1/run.sh`
+- Compatibility alias: `experiments/Medusa/run.sh`
+
+Baseline runtime knobs:
+- `NGRAM_EVAL_ORDER=0`
+- `USE_CRAWLER=1`
+- `NUM_FLAT_LAYERS=4`
+- `NUM_CRAWLER_LAYERS=1`
+- `CRAWLER_LOOPS=4`
+- `INST_DIM=32`
+- `CRAWLER_QUANT_INT8=1`
+- `DELTA_NET_HEADS=0`
+- `SKIP_EMA=1`
+- `SKIP_GPTQ=1`
+
+## Confirmed Code Seams (train_gpt.py)
+- Data shard parse/load: `load_data_shard` and `TokenStream`/`DistributedTokenLoader`.
+- Training hot path: `DistributedTokenLoader.next_batch`.
+- Eval hot path: `eval_val_sliding` window assembly loop.
+- Export hot path: int6 pack/compress path near final export.
+- EMA/GPTQ cut points: `SKIP_EMA` and `SKIP_GPTQ` gates in finalization section.
+
+## Nitrust Compatibility Note
+Crawler shard files are headered, not raw `u16`; they contain a 256x`i32` prefix:
+- `magic=20240520`
+- `version=1`
+- `num_tokens` in header slot 2
+
+`nitrust-mmap-loader` now supports this format with strict size checks,
+while retaining raw `u16` fallback.
+
+## Complexity-Ordered Knockout Plan (NGRAM-Free)
+1. NIT-A1: Swap shard reads to Rust mmap path (no model math changes).
+2. NIT-A2: Rust batch assembly + pinned host buffer handoff.
+3. NIT-B1: Rust sliding-window index builder for eval path.
+4. NIT-C1: Rust quant/export pack pipeline.
+5. NIT-D1: CUDA graph replay wrapper for fixed-shape training steps.
+
+## Quarantine Rule
+DeltaNet remains sandbox-only during this leg (`experiments/Medusa/run_delta_sandbox.sh`).
+
+## Success Gates
+- Primary: lower `step_avg` and higher tokens/sec at equal wallclock.
+- Quality: `final_int6_roundtrip_exact` and `final_int6_sliding_window_exact`
+  within tolerance (`val_bpb` delta <= +0.01 unless explicitly traded for speed).
+- Reproducibility: deterministic run logs with fixed seed.