Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
154 commits
Select commit Hold shift + click to select a range
4ce0d59
X-WING 3D Cubric: 0.4820 BPB (3-seed mean, std 0.0002)
Mar 26, 2026
6c49da3
B-wing lab: port PR #809 n-gram techniques onto X-WING base
Mar 26, 2026
bee0716
B-wing II: cubric ON + entropy shift + fast TTT
Mar 26, 2026
d6d281a
B-wing III: LoRA TTT from #809 + cubric ON + all n-gram fixes
Mar 26, 2026
137432f
Record bwing_full_port seed 1337: 0.4512 BPB
Mar 26, 2026
94bb107
Replace bwing_III with copy of SOTA bwing_full_port (0.4512 BPB)
Mar 26, 2026
2c0c0ee
B-wing IV + V: fix 7→9 hash primes (order 8-9 collision bug)
Mar 26, 2026
3ebaf38
Add B-wing pod setup script (FA3 + zstandard + sp1024)
Mar 26, 2026
5a21365
Add n-gram parameter grid sweep for bwing_V
Mar 26, 2026
75dbe40
A-Wing Green: INT5 GPTQ (clip_range=15) + 9-prime hash fix
Mar 26, 2026
22eae2a
A-Wing Green: strip TTT, cubric, F1 correction, distillation
Mar 26, 2026
d6cb709
Record results: A-Wing Green 0.4576, bwing_V 0.4601
Mar 26, 2026
c37a8ab
A-Wing Green_1: Oracle Alpha — use model_p vs ngram_p directly
Mar 26, 2026
08d6b7c
Green_1: cap training at 570s to fit GPTQ in 600s budget
Mar 26, 2026
d8b6022
Green_1: add preflight checks (zstd, FA3) + zstd import warning
Mar 26, 2026
b1d45b8
A-Wing Green_2: Oracle Alpha + LoRA TTT + 9-Prime
Mar 26, 2026
88ec4ca
Fix pod setup: use system Python, no conda/PYTHONPATH hacks
Mar 26, 2026
5876cf5
NEW SOTA 0.3200 BPB: A-Wing Green_1 Oracle Alpha + 9-Prime
Mar 26, 2026
da832ba
A-Wing Purple: Learned Mixer Head for legal n-gram ceiling
Mar 26, 2026
2b38218
Add pod_launch.sh: one command for clone + setup + run
Mar 26, 2026
a37d7c3
Fix pod_launch.sh: pull from private repo (fork1), not public
Mar 26, 2026
6004ac7
Purple: reduce prefill to 20 shards (~2B tokens), restore 570s cap
Mar 26, 2026
230dfc6
Clean up repo: single pod_setup.sh, archive stale dirs
Mar 26, 2026
db300a0
Fix pod_setup.sh: workspace path is /workspace/parameter-golf
Mar 26, 2026
2a92a77
F-Wing: Frugendorff + X-WING N-gram combined concept
Mar 26, 2026
473a4b7
Fix REPO_DIR depth in F_Wing run scripts (3 levels up, not 2)
Mar 26, 2026
5e8ec28
Add A-wing RED mixer variant with bounded distributed prefill
Mar 26, 2026
4a06a37
Add A-wing RED_G GPU monster mixer path and tune RED
Mar 26, 2026
3cedb3f
Fix DDP warmup by including mixer supervision in RED variants
Mar 26, 2026
005cdc5
records: add A-WING RED_G seed1337 run summary
Mar 26, 2026
4a4be33
F-Wing: rebase train_gpt.py onto A_wing/RED (add CrawlerGPT + mixer s…
Mar 26, 2026
f09a6e5
RED_G: fix ngram blend-mode conflicts and wire order-aware eval controls
Mar 26, 2026
abe72f0
F-Wing: fix CrawlerGPT torch.compile compatibility
Mar 26, 2026
a76dda4
Add A-Wing green_3: width bump to model_dim=640
Mar 26, 2026
5e27afc
Add A-Wing green_1A: legal alpha + PR#609 improvements
Mar 27, 2026
aa0a156
Optimize green_1A selective pruning: fast zstd-1 for binary search
Mar 27, 2026
411dea1
Add Cobra base-quality 10min harness plan and tooling
Mar 27, 2026
3b4b821
Add pod_setup_cobra bootstrap script
Mar 27, 2026
90741b4
Rat Rod Green: Parallel Muon base + GPTQ stripped for pure base model…
Mar 27, 2026
e32f32b
Rat Rod Green v2: kill late QAT + enable trigram
Mar 27, 2026
ec7ab9f
Rat Rod v3: MTP_NUM_HEADS=2, revert trigram (v2 was a wash)
Mar 27, 2026
05d3990
Rat Rod v3: MTP_NUM_HEADS=2 experiment (separate dir, green untouched)
Mar 27, 2026
fd4fb31
A/B test: ROPE_DIMS 16 vs 24 (200s quick runs)
Mar 27, 2026
4d58515
Log A/B test result: ROPE_DIMS=16 control @ 200s
Mar 27, 2026
2479ced
Rat Rod v4: HS-MTP — hash-space multi-token prediction
Mar 27, 2026
aefe581
Rat Rod v4: add CPU n-gram bridge for HS-MTP weighting
Mar 27, 2026
fc73010
Fix: set _hsmtp_w during warmup phase (torch.compile NoneType crash)
Mar 27, 2026
5a6a771
Document Synapse system (HS-MTP + CPU N-gram Bridge) in PROGRESS.md
Mar 27, 2026
b14ef45
A/B test: VALUE_RESIDUAL 0 vs 1 (200s quick runs)
Mar 27, 2026
b7e9a07
Rat Rod v5: Synapse v2 — GPU-native hash bridge (<1ms/step)
Mar 27, 2026
e11298f
Freeze SOTA: A_WING_GREEN base 1.1129 ngram 0.4489 (2026-03-27)
Mar 27, 2026
7d2b520
Add SOTA folder README: only add, never delete or modify
Mar 27, 2026
8ec6a61
Fix: remove step counter that breaks torch.compile fullgraph
Mar 27, 2026
e3ae59a
Log v4/v5 Synapse + VALUE_RESIDUAL results — all dead
Mar 27, 2026
6ae55ed
Rat_Rod_Purple_1: training oracle + Dirichlet mixing + matrix_lr=0.03
Mar 27, 2026
c48c060
Purple_1: disable training oracle by default (legally gray)
Mar 27, 2026
f8caa0c
Rat Rod: add zero-cost H100 sweeps and robust trainer toggles
Mar 27, 2026
9e826d9
Purple_1: phrase cache + regime tracker + warmdown=2000 + chunk=65K
Mar 27, 2026
c185a8d
Add Siphon: ensemble-objective training + WARMDOWN2000 SOTA entry
Mar 27, 2026
63c27e1
FX-Wing: Instructed Recurrence — content-derived loop instructions fo…
Mar 27, 2026
4ab4ced
FX-Wing: add hypothesis and ablation plan
Mar 27, 2026
7a81eec
Reorganize: move master runner to experiments/Biology_concepts/run_al…
Mar 27, 2026
95e9333
Add green v6 (optimized SOTA): v1 + WARMDOWN_ITERS=2000
Mar 27, 2026
5268082
Add Biology Concepts sweep findings — tornado vs baseline analysis
Mar 27, 2026
516e2c8
Add green v7: v6 + COMPLEMENT_ALPHA=0.5
Mar 27, 2026
9a58d14
FX-Wing: fix compile — COMPILE_FULLGRAPH=0 for crawler loop
Mar 27, 2026
15c66fc
FX-Wing: CRAWLER_LOOPS=4 — exploit weight-sharing compression
Mar 27, 2026
909901e
Log v7 results: COMPLEMENT_ALPHA=0.5 worse than v1
Mar 27, 2026
812599d
FX-Wing: CRAWLER_QUANT_INT8 — int8 precision for shared crawler block
Mar 27, 2026
c641e5e
Add vast_fxwing_single.sh — single GPU FX-Wing launcher for Vast.ai
Mar 27, 2026
ce5e317
Add Cambrian: DeltaNet × Biology Concepts architecture
Mar 27, 2026
38479b9
Cambrian-0: GatedDeltaNet × Bio Seam architecture skeleton
Mar 27, 2026
2df9c72
Fix bio concept scripts: make MAX_WALLCLOCK_SECONDS env-overridable
Mar 27, 2026
8b93705
Cambrian-1: Add four bio seam controllers (Myelin, Circadian, Clonal,…
Mar 27, 2026
b0776f1
FX-Wing micro: device-flexible concept test for GB10 Blackwell DGX Spark
Mar 27, 2026
da80af1
FX-Wing: add DeltaNet associative memory to crawler reservoir
Mar 27, 2026
fa21139
FX-Wing micro: add -u flag for unbuffered stdout through tee pipe
Mar 27, 2026
531f98f
vast: blacklist offer 33510639 (103.42.50.244 — SSH never connects)
Mar 27, 2026
ff7069b
FX-Wing DeltaNet: disable compile on forward to prevent T-loop OOM
Mar 27, 2026
36845e3
FX-Wing run.sh: DELTA_NET_HEADS=0 for core concept test
Mar 27, 2026
fa4c218
FX-Wing: suppress inductor NaN in RoPE bounds analysis (PyTorch 2.4 bug)
Mar 27, 2026
b4968be
Cambrian: disable torch.compile on GatedDeltaNet.forward
Mar 27, 2026
f74175d
Cambrian run.sh: set COMPILE_FULLGRAPH=0
Mar 27, 2026
cea3b8b
Fix astrocyte gate shape bug: view(B,1,1) not unsqueeze(1).unsqueeze(2)
Mar 27, 2026
5fac1c8
GreenRod X_1: Hybrid DeltaNet + Attention engine
Mar 27, 2026
b55a421
Cambrian: forward PYTORCH_CUDA_ALLOC_CONF to torchrun (expandable_seg…
Mar 27, 2026
15714f9
Cambrian: remove @torch.compiler.disable from GDN.forward
Mar 27, 2026
24dd550
FX_Wing_Delta: flow instructions + DeltaNet + hypothesis
Mar 27, 2026
0b2164d
Cambrian: restore @torch.compiler.disable, default wallclock 600s
Mar 27, 2026
9c34b42
FX_Wing_Sigma: n-gram entropy as smoothing reference hypothesis
Mar 27, 2026
0c623c7
Add Cambrian bio seam sweep script
Mar 27, 2026
96bc2b4
FX_Wing_Delta: disable DeltaNet for flow-only test, add inductor patch
Mar 27, 2026
3adddb0
FX_Wing_Delta_DN: DeltaNet with gradient checkpointing + truncated BPTT
Mar 27, 2026
7b5e09c
Fix Cambrian bio sweep hang: SKIP_FINAL_EVAL=1 + process cleanup
Mar 27, 2026
c7ffeec
Deprecate FX_Wing* experiments; add FA_Wing_Green_1 gitignore
Mar 27, 2026
c9600c7
Add Cambrian agent instructions for Vast.ai sweep
Mar 27, 2026
03f9838
Add FA_Wing_GreenDN_1 (flow instructions + DeltaNet); gitignore both …
Mar 27, 2026
7c197c7
Add FA_Wing_Green_1 and FA_Wing_GreenDN_1 experiment code
Mar 27, 2026
0a89f4a
Fix REPO_ROOT depth in FA_Wing run.sh files (../.. not ../../..)
Mar 27, 2026
8037fce
Fix DDP unused-params crash: disable VE in FA_Wing crawler runs
Mar 27, 2026
3651d35
Add ClownCar experiment; restore FX_Wing_Delta from deprecated
Mar 27, 2026
f2a4f5f
ClownCar: disable ngram eval — sliding window baseline only
Mar 27, 2026
5ae2be5
Add ClownCar_II: canonical FLA DeltaNet + Crawler symbiotic pairing
Mar 27, 2026
ba4a2a7
Fix ClownCar/II run.sh: add missing crawler flags (USE_CRAWLER=1 etc.)
Mar 27, 2026
87ad173
ClownCar_II: add FLA ops preflight check to confirm canonical kernel …
Mar 28, 2026
e3ba281
Fix ClownCar_II: cast q/k/v/beta to x.dtype before chunk_delta_rule
Mar 28, 2026
c0cf2ac
Add ClownCar_IV: GPTQ bypass + state dtype fix
Mar 28, 2026
5d9e0b2
Fix ClownCar_IV: revert state dtype cast — only change is SKIP_GPTQ=1
Mar 28, 2026
a7d53c8
ClownCar_IV: SKIP_GPTQ only — restored from known-good e3ba281
Mar 28, 2026
baceb10
ClownCar_IV: reset to ClownCar_II base + EMA_DECAY=0.99
Mar 28, 2026
e587c91
ClownCar_IV: remove GPTQ, use naive int6
Mar 28, 2026
c262086
Add ClownCar_VI and Medusa: skip EMA + naive int6
Mar 28, 2026
07a57bf
pod_setup: add fla + attr install for DeltaNet
Mar 28, 2026
d9db34d
Add ClownCar_VII: loop-aware 2-phase GPTQ + no EMA
Mar 28, 2026
cc06d3b
Medusa: sync to ClownCar_VII (loop-aware GPTQ + no EMA)
Mar 28, 2026
ebc4b84
Add Medusa_II: late-start EMA (step 4400) + loop-aware GPTQ
Mar 28, 2026
4aa704b
Medusa_II: add short exit-only unravel A/B harness
Mar 28, 2026
9d1be62
Add Medusa_IV: copy of Medusa_III (winning 1.0366 config)
Mar 28, 2026
4b1c51c
Medusa_II: force finish-only A/B and add one-command launcher
Mar 28, 2026
d2f47e2
Add Medusa_V: fix state dtype cast (new_state.to(dtype))
Mar 28, 2026
0c38323
Medusa_II: add additional-only unravel check runner
Mar 28, 2026
d74538f
Add Medusa_V_SOTAMAXX: frozen SOTA config snapshot
Mar 28, 2026
9fa4fec
Add Medusa_VI: DeltaNet projections → CastedLinear for QAT coverage
Mar 28, 2026
0ce12a6
Records: fill Medusa_IV known results (seeds 300, 1337)
Mar 28, 2026
a4a5447
Records: Medusa Unstable README with known results
Mar 28, 2026
5f731b3
Records: Medusa_IV 3-seed complete — seed 42=0.8104 BPB (best), mean=…
Mar 28, 2026
79f45ae
Add Medusa_Legal_unstable: fix GPTQ training-data access after wallcl…
Mar 28, 2026
556b2fc
Medusa_VII: causality fix + shard header fix + DeltaNet ablation
Mar 29, 2026
3e09695
Medusa_VII: add ablation results
Mar 29, 2026
f74b9c9
Bandit: ClownCar crawler + X-WING ngram oracle
Mar 29, 2026
3a75282
Bandit: fix GPTQ wallclock violation (GPTQ_RESERVE_MS=30s)
Mar 29, 2026
4efa746
Bandit: ClownCar Crawler x Cubric Ngram9 — 0.4961 BPB (3-seed mean)
Mar 29, 2026
e6d11d8
Log JR-03 fused MLP result as loser (with Triton-node caveat)
Mar 30, 2026
1a8501a
Crawler_Leg_1: add run_all.sh sequencer for all 11 ablation arms
Mar 30, 2026
946f0a7
Rascal II: skip GPTQ + embed int6 — full 600s, target <16MB
Mar 30, 2026
f1ce7c9
SOTA: Rascal II — new best legal submission 1.10986874 BPB, 15.44MB
Mar 30, 2026
39ed402
Record: Rascal — val_bpb 1.1099 (3-seed mean)
Mar 30, 2026
1d48f9c
Add FX_Wing_Delta_safe: byte-identical backup of FX_Wing_Delta
Mar 30, 2026
9a15ace
Add ChopShop (cleaned Rascal base) and Rascal_Stripper smoke test
Mar 30, 2026
964fd8b
Add all research data: experiments, records, scripts, Nitrust, octavian
Mar 30, 2026
7de8402
Crawler_Leg_2: 5-arm sweep combining loops=3 + mlp=5.0 wins
Mar 30, 2026
da7a6b0
Bandit_Wagon: remove NGRAM code, apply optimal CL1 config
Mar 30, 2026
d2d1ecd
Rascal_Stripper: implement TurboMuon + EngramLite + TTT + CROWN-Q
Mar 30, 2026
2d7022b
Crawler_Leg_2: set wallclock to 350s (~4k steps on 8xH100)
Mar 30, 2026
b39f23c
Bandit_Wagon: rewrite HYPOTHESIS.md for pure neural crawler campaign
Mar 30, 2026
4f37849
Rascal_Stripper: fix CROWN-Q variable name collision (scale → q_scale)
Mar 30, 2026
4603c48
Rascal_Stripper: bump smoke test to 3200 steps (warmdown 800)
Mar 30, 2026
206434c
CL2 results: 1.19593 BPB — loops=3+mlp=5.0+LOOP_AWARE_GPTQ+COMPILE wi…
Mar 30, 2026
dd9f4fd
Crawler_Leg_3: full 600s run, loops=3 mlp=6.0, Rascal warmdown style
Mar 30, 2026
0e2286f
Rascal_Stripper: add ttt_calibrate.py — standalone TTT hyperparameter…
Mar 30, 2026
9de1f3b
Crawler_Leg_3: multi-seed script + submission skeleton
Mar 30, 2026
411970f
Rascal_Stripper: add ttt_sweep.sh — 3-config TTT calibration runner
Mar 30, 2026
8b17867
Rascal III: TurboMuon + EngramLite combo runner (600s production)
Mar 30, 2026
1194948
Crawler submission: 3-seed complete, 1.1874 BPB mean
Mar 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/
experiments/archive/checkpoints/
46 changes: 46 additions & 0 deletions Nitrust/CLOWNCAR_IV_TARGET.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Nitrust Target Lock — ClownCar_IV (Superseded)
Date: 2026-03-27

This target note is superseded by `Nitrust/MEDUSA_TARGET.md`.

## Baseline Contract
Optimization target is `experiments/ClownCar_IV`.

Baseline runtime knobs (from `run.sh`):
- `NGRAM_EVAL_ORDER=0`
- `USE_CRAWLER=1`
- `NUM_FLAT_LAYERS=4`
- `NUM_CRAWLER_LAYERS=1`
- `CRAWLER_LOOPS=4`
- `INST_DIM=32`
- `CRAWLER_QUANT_INT8=1`
- `DELTA_NET_HEADS=4`
- `SKIP_GPTQ=1`

## Confirmed Code Seams (train_gpt.py)
- Data shard parse/load: `load_data_shard` and `TokenStream`/`DistributedTokenLoader`.
- Training hot path: `DistributedTokenLoader.next_batch`.
- Eval hot path: `eval_val_sliding` window assembly loop.
- Export hot path: int6 pack/compress path near final export.

## Nitrust Compatibility Note
ClownCar shard files are not raw `u16`; they contain a 256x`i32` header:
- `magic=20240520`
- `version=1`
- `num_tokens` in header slot 2

`nitrust-mmap-loader` has been updated to support this format with strict size checks,
while still supporting raw `u16` legacy files.

## Complexity-Ordered Knockout Plan (NGRAM-Free)
1. NIT-A1: Swap shard reads to Rust mmap path (no model math changes).
2. NIT-A2: Rust batch assembly + pinned host buffer handoff.
3. NIT-B1: Rust sliding-window index builder for eval path.
4. NIT-C1: Rust quant/export pack pipeline.
5. NIT-D1: CUDA graph replay wrapper for fixed-shape training steps.

## Success Gates
- Primary: lower `step_avg` and higher tokens/sec at equal wallclock.
- Quality: `final_int6_roundtrip_exact` and `final_int6_sliding_window_exact`
within tolerance (`val_bpb` delta <= +0.01 unless explicitly traded for speed).
- Reproducibility: deterministic run logs with fixed seed.
43 changes: 43 additions & 0 deletions Nitrust/COMMANDER_ORDERS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Nitrust Commander Orders — Crawler-Only Sprint Sequence
Date: 2026-03-29

## Command Intent
Increase end-to-end speed via Rust hardware modules outside crawler internals.
Do not depend on ngram systems for wins.
Bandit is current SOTA reference while crawler-only leg is rebuilt.

## Sprint Queue (In Order)

| Sprint | Modules | Goal | Gate |
|---|---|---|---|
| A | NR-01 + NR-02 | Remove Python data path bottlenecks and overlap H2D transfers | >=10% throughput gain, no metric regression |
| B | NR-03 | Accelerate sliding-window eval infra | >=25% eval wallclock reduction |
| C | NR-04 | Compress/export faster with deterministic pack pipeline | >=2x export speedup, bit-exact roundtrip |
| D | NR-05 | Reduce launch overhead with CUDA graph replay | >=10% train step reduction |
| E | NR-06 | Stabilize topology-level performance | lower p95 step jitter and +3% throughput |
| F | NR-07 | Online parameter tuning | additional >=5% gain over Sprint E |

## Non-Negotiables
1. Every sprint ships with A/B benchmark evidence.
2. No sprint proceeds if parity checks fail.
3. Any speed gain that harms baseline quality beyond tolerance is rejected.

## Benchmark Baseline Spec
Use `experiments/Crawler_Leg_1/run.sh` profile with:
- `NGRAM_EVAL_ORDER=0`
- `USE_CRAWLER=1`
- `NUM_FLAT_LAYERS=4`
- `NUM_CRAWLER_LAYERS=1`
- `CRAWLER_LOOPS=4`
- `INST_DIM=32`
- `CRAWLER_QUANT_INT8=1`
- `DELTA_NET_HEADS=0`
- `SKIP_EMA=1`
- `SKIP_GPTQ=1`
- fixed seed and wallclock

## Immediate Next Action
Execute Sprint A/B/C on crawler-only lane:
1. Keep `nitrust-py` import path optional with strict parity checks.
2. Benchmark Rust mmap + pinned batcher on crawler-only ablation grid.
3. Add eval/export Rust path tests only after crawler baseline is stable across two seeds.
70 changes: 70 additions & 0 deletions Nitrust/CRAWLER_DELTA_BOOSTER_MATRIX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Nitrust Crawler/Delta Booster Matrix
Date: 2026-03-27
Target: `experiments/Medusa` (Crawler + DeltaNet)
Scope: architecture/external-system boosters, plus Rust integration ablations
Guardrail: NGRAM disabled for all signal tests (`NGRAM_EVAL_ORDER=0`)
Active lane: crawler-only (`DELTA_NET_HEADS=0`) until DeltaNet sandbox re-validation succeeds

Update (2026-03-29):
- DeltaNet is quarantined from the main crawler run path pending re-validation.
- Bandit is treated as current SOTA reference while crawler-only leg is rebuilt.

## Master Hypothesis Table

| ID | Area | Booster Hypothesis | Primary Knobs | Expected Win | Risk | Smoke Ready |
|---|---|---|---|---|---|---|
| CDB-01 | Quant Bridge | Loop-aware GPTQ (flat first, crawler second) beats one-shot GPTQ. | `LOOP_AWARE_GPTQ` | Better int6 roundtrip BPB | Cal cost | Yes |
| CDB-02 | Quant Bridge | Keep crawler tensors int8 while flat stays int6. | `CRAWLER_QUANT_INT8` + export policy | Less loop error compounding | Size creep | Yes |
| CDB-03 | Quant Bridge | Per-loop crawler dequant scales reduce distribution drift. | per-loop scale banks | Better loop stability | Metadata size | No |
| CDB-04 | Quant Bridge | Skip GPTQ for flat, GPTQ only crawler+delta. | selective GPTQ groups | Faster quant + similar BPB | Flat quality drop | No |
| CDB-05 | Delta Core | Delta head count has a sweet spot (under/over hurts). | `DELTA_NET_HEADS` sweep | Better quality/compute | Runtime cost | Yes |
| CDB-06 | Delta Core | Delta state precision policy impacts stability. | bf16/fp16/fp32 state | Fewer drift errors | Throughput hit | No |
| CDB-07 | Delta Core | Delta residual gate controls over-write chaos across loops. | residual gate scalar/schedule | Better convergence | Under-updating | No |
| CDB-08 | Delta Core | Delta state norm clipping prevents runaway memory. | clip threshold | Robustness | Lost signal | No |
| CDB-09 | Delta Core | Periodic delta state reset improves long-run conditioning. | reset cadence | More stable training | Loses long memory | No |
| CDB-10 | Delta Core | Head-dim tensor-core alignment boosts Delta throughput. | aligned dims / head_dim | Faster kernels | Architecture constraints | No |
| CDB-11 | Crawler Loop | Instruction bottleneck size has optimal range. | `INST_DIM` sweep | Better loop routing | Under/overfit | Yes |
| CDB-12 | Crawler Loop | Loop-specific low-rank adapters beat fully shared core. | loop LoRA rank | BPB gain at small bytes | Params grow | No |
| CDB-13 | Crawler Loop | Split sharing (shared attn, modulated MLP) improves regime handling. | attn shared + MLP gates | BPB gain | Complexity | No |
| CDB-14 | Crawler Loop | Last loop partial unsharing captures final-pass specialization. | unshare depth=1 | BPB gain with low byte cost | Param creep | No |
| CDB-15 | Crawler Loop | Dual-rate loops (heavy every 2nd loop) improve quality/compute. | heavy cadence | Better speed-quality frontier | Scheduler bugs | No |
| CDB-16 | Crawler Loop | Adaptive loop count by confidence reduces wasted compute. | short/long bucket policy | Throughput gain | Control overhead | No |
| CDB-17 | Crawler Loop | Loop state carry with explicit damping improves fixed-point stability. | carry decay | Better convergence | Slower adaptation | No |
| CDB-18 | Crawler Loop | Loop dropout/stochastic depth improves shared-block generalization. | loop drop prob | Better robustness | Instability | No |
| CDB-19 | Crawler Topology | Memory tokens across loops add persistent workspace. | memory token count | Better long context | Extra compute | No |
| CDB-20 | Crawler Topology | Latent funnel recurrence (T->T/2 core) is superior at equal bytes. | funnel ratio | Speed or BPB gain | Complexity | No |
| CDB-21 | Crawler Topology | Encoder/decoder depth rebalance improves compression frontier. | flat/crawler split | Better byte-efficiency | tuning overhead | Yes |
| CDB-22 | Crawler Topology | Add tiny per-loop channel gates for activation alignment. | gate width | Better loop reuse | Small extra params | No |
| CDB-23 | Rust Data Path | Rust mmap shard reader reduces loader stalls. | `NITRUST_ENABLE` | Step-time drop | bridge overhead | Yes |
| CDB-24 | Rust Data Path | Strict mode catches silent Rust-path regressions early. | `NITRUST_STRICT` | Safer ops | hard fail risk | Yes |
| CDB-25 | Rust Data Path | Pinned host batcher improves H2D overlap. | prefetch depth, pinned on/off | Throughput gain | Memory pressure | Partial |
| CDB-26 | Rust Eval | Rust sliding-window index engine slashes eval wallclock. | window engine on/off | Faster eval | parity bugs | No |
| CDB-27 | Rust Export | Rust quant pack pipeline accelerates `.ptz` creation. | quantpack on/off | Faster export | bit-exact risk | No |
| CDB-28 | Runtime | CUDA graph replay cuts launch overhead on static smoke shapes. | graph on/off | Step-time drop | graph fragility | No |
| CDB-29 | Runtime | NUMA/affinity pinning lowers p95 jitter on multi-GPU hosts. | affinity profile | Stability gain | host variance | No |
| CDB-30 | Runtime | Online autotune for batch/prefetch finds hidden headroom. | autotune budget | extra throughput | tune noise | No |
| CDB-31 | Scheduling | Warmdown/EMA/GPTQ ordering matters for final int6 quality. | `SKIP_EMA`, warmdown, GPTQ mode | Better end BPB | confounding effects | Yes |
| CDB-32 | Scheduling | Distill-after-loop-aware-GPTQ may recover quantization loss. | distill flags + GPTQ mode | Better final BPB | extra time | No |

## Spark Smoke Queue (v0)

| Run ID | Ablation | Delta from baseline | Status |
|---|---|---|---|
| SMK-00 | Baseline smoke | Medusa smoke config, `NITRUST_ENABLE=0` | Completed: roundtrip `6.02582801`, sliding `5.97225220` |
| SMK-01 | Rust loader ON | `NITRUST_ENABLE=1`, `NITRUST_STRICT=1` | Completed: roundtrip `6.02584613`, sliding `5.97228266` |
| SMK-02 | Delta heads OFF | `DELTA_NET_HEADS=0` + Rust ON | Completed: roundtrip `4.91216360`, sliding `4.90379569` |
| SMK-03 | Crawler int8 OFF | `CRAWLER_QUANT_INT8=0` + Rust ON | Completed: roundtrip `6.02587901`, sliding `5.97224063` |
| SMK-04 | Instruction OFF | `INST_DIM=0` + Rust ON | Completed: roundtrip `6.00549835`, sliding `5.95337039` |

## Smoke Config Contract
- Tiny dataset clone in `/tmp/nitrust_smoke_data` (header-compatible shards)
- Single Spark GPU smoke (`NPROC=1` style run)
- `VAL_LOSS_EVERY=0` to avoid known step-0 eval/autograd conflict during smoke
- Early-stop via wallclock cap + tiny iteration budget

## Initial Spark Readout
- Rust loader ON (`SMK-01`) is numerically neutral vs baseline in smoke (difference in the 1e-5 range on BPB).
- `CRAWLER_QUANT_INT8=0` (`SMK-03`) is also neutral in this tiny smoke setup.
- `INST_DIM=0` (`SMK-04`) slightly improved smoke BPB, but this is low-confidence at smoke scale.
- `DELTA_NET_HEADS=0` (`SMK-02`) changed the task dynamics substantially and ran much faster; treat as topology sanity check, not a like-for-like quality verdict.
- Artifact logs/summary captured at `results/nitrust_spark_smoke_20260327_234343/`.
63 changes: 63 additions & 0 deletions Nitrust/HYPOTHESES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Nitrust Program — Hypothesis Backlog (NGRAM-Free)
Date: 2026-03-27

## Mission
Build foundational, hardware-first architecture upgrades above the crawler line that improve:
1. Model-only quality (`val_bpb`, no ngram mixing)
2. Artifact efficiency (bytes at fixed or better quality)
3. Throughput (step time / tokens-per-second)

## Hard Rules (Nitrust Phase 1)
1. Ignore all ngram paths for training and eval.
2. Compare only model outputs (`final_int6_roundtrip`, `final_int6_sliding_window`).
3. Keep export/legal path simple while architecture is changing.

### NGRAM-Off Guardrail
Use these defaults for all Nitrust runs unless explicitly overridden:
- `NGRAM_EVAL_ORDER=0`
- `NGRAM_EVAL_ADAPTIVE=0`
- `NGRAM_DIRICHLET=0`
- `PHRASE_CACHE=0`
- `REGIME_TRACKER=0`
- `NGRAM_ENTROPY_SHIFT=0`
- `TRIGRAM=0`

## Baseline First (NIT-00)
Before every new injection, re-run a stable baseline with the exact same wallclock budget and seed policy.

Success baseline record should include:
- `step@cap`, `val_bpb@cap`
- `final_int6_roundtrip_exact`
- `final_int6_sliding_window_exact`
- `Serialized model int6+*` bytes
- step average ms

---

## Ordered Hypotheses (Low -> High Complexity)

| ID | Complexity | Hypothesis | Architecture Injection | Hardware Rationale | Success Gate | Kill Gate |
|---|---:|---|---|---|---|---|
| NIT-01 | 1 | Hopper shape locking improves throughput without quality loss. | Lock dims/head dims to tensor-core-friendly multiples; remove odd shapes in recurrent path. | Fewer kernel variants, better matmul occupancy/fusion. | >=8% faster step time, `val_bpb` delta <= +0.01 | <3% speed gain or `val_bpb` worse by >0.02 |
| NIT-02 | 2 | Loop-conditioned low-rank adapters fix shared-block regime mismatch. | Shared core stays fixed, per-loop `W_k = W + A_k B_k` (small rank). | Keeps parameter compression while giving each loop a cheap specialization path. | Better `final_int6_sliding_window` by >=0.02 at <=15% artifact growth | No quality gain or artifact growth >20% |
| NIT-03 | 3 | Split sharing (shared attention, loop-specific MLP modulation) beats fully shared blocks. | Share attention weights; add tiny per-loop channel gates or low-rank MLP deltas. | Attention kernels stay reusable; cheap MLP modulation handles loop-specific distributions. | >=0.02 BPB improvement vs NIT-00 with <=20% slower step time | Regresses both speed and BPB |
| NIT-04 | 4 | Bucketed adaptive loop budget improves quality-per-compute. | Two static paths: short-loop and long-loop based on confidence bucket at sequence/window level. | Preserves static-ish execution while reducing unnecessary deep passes. | Same or better BPB with >=15% faster average step time | Control overhead removes speed gain |
| NIT-05 | 5 | Latent funnel recurrence dominates flat+bottleneck at same bytes. | Downsample sequence in bottleneck (`T -> T/2`), run recurrent core there, upsample back. | Shifts work to denser GEMMs and lowers KV bandwidth pressure. | >=0.03 BPB gain or >=20% speedup at comparable artifact size | Training instability or quality collapse |
| NIT-06 | 6 | Persistent memory tokens make recurrence actually cumulative. | Add small memory token bank carried across loops and rewritten each loop. | Small fixed memory adds global workspace without large parameter cost. | >=0.02 BPB gain over NIT-05 with <=10% speed hit | No measurable gain after two seeds |
| NIT-07 | 7 | Dual-rate recurrent superblock wins the frontier. | Heavy attention every 2 loops, lightweight update each loop (multi-rate core). | Cuts expensive attention frequency while keeping iterative refinement depth. | Better BPB and speed-vs-quality tradeoff than NIT-05/06 | Scheduling complexity causes compile/runtime fragility |

---

## Execution Order
1. NIT-00 baseline freeze
2. NIT-01 shape locking
3. NIT-02 low-rank loop adapters
4. NIT-03 split sharing
5. NIT-04 adaptive loop buckets
6. NIT-05 latent funnel
7. NIT-06 memory tokens
8. NIT-07 dual-rate superblock

## Notes
- Do not introduce ngram-dependent compensators while validating core architecture signal.
- Any candidate that wins only with ngram is considered unproven for Nitrust Phase 1.
51 changes: 51 additions & 0 deletions Nitrust/MEDUSA_TARGET.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Nitrust Target Lock — Crawler Mainline (Medusa Delta)
Date: 2026-03-29

## Baseline Contract
Optimization target is crawler-only mainline:
- Canonical launcher: `experiments/Crawler_Leg_1/run.sh`
- Compatibility alias: `experiments/Medusa/run.sh`

Baseline runtime knobs:
- `NGRAM_EVAL_ORDER=0`
- `USE_CRAWLER=1`
- `NUM_FLAT_LAYERS=4`
- `NUM_CRAWLER_LAYERS=1`
- `CRAWLER_LOOPS=4`
- `INST_DIM=32`
- `CRAWLER_QUANT_INT8=1`
- `DELTA_NET_HEADS=0`
- `SKIP_EMA=1`
- `SKIP_GPTQ=1`

## Confirmed Code Seams (train_gpt.py)
- Data shard parse/load: `load_data_shard` and `TokenStream`/`DistributedTokenLoader`.
- Training hot path: `DistributedTokenLoader.next_batch`.
- Eval hot path: `eval_val_sliding` window assembly loop.
- Export hot path: int6 pack/compress path near final export.
- EMA/GPTQ cut points: `SKIP_EMA` and `SKIP_GPTQ` gates in finalization section.

## Nitrust Compatibility Note
Crawler shard files are headered, not raw `u16`; they contain a 256x`i32` prefix:
- `magic=20240520`
- `version=1`
- `num_tokens` in header slot 2

`nitrust-mmap-loader` now supports this format with strict size checks,
while retaining raw `u16` fallback.

## Complexity-Ordered Knockout Plan (NGRAM-Free)
1. NIT-A1: Swap shard reads to Rust mmap path (no model math changes).
2. NIT-A2: Rust batch assembly + pinned host buffer handoff.
3. NIT-B1: Rust sliding-window index builder for eval path.
4. NIT-C1: Rust quant/export pack pipeline.
5. NIT-D1: CUDA graph replay wrapper for fixed-shape training steps.

## Quarantine Rule
DeltaNet remains sandbox-only during this leg (`experiments/Medusa/run_delta_sandbox.sh`).

## Success Gates
- Primary: lower `step_avg` and higher tokens/sec at equal wallclock.
- Quality: `final_int6_roundtrip_exact` and `final_int6_sliding_window_exact`
within tolerance (`val_bpb` delta <= +0.01 unless explicitly traded for speed).
- Reproducibility: deterministic run logs with fixed seed.
Loading