|
| 1 | +# Spec 046L — deploy-time quant repair (use eval headroom, bypass byte cap) |
| 2 | + |
| 3 | +**Slug:** `046L-deploy-time-quant-repair` |
| 4 | +**Created:** 2026-04-27 |
| 5 | +**Status:** DRAFT — needs ~50-100 lines code (mostly wiring existing 046E/F) |
| 6 | +**Branch:** `exp/046-quant-repair` |
| 7 | +**Commit:** TBD (after code lands) |
| 8 | +**Parent:** `research/ideas/quant-repair-fundamentally-new.md` (Idea A — highest EV) |
| 9 | + |
| 10 | +## The fundamental shift |
| 11 | + |
| 12 | +Every quant-repair lever we tried (046A-K) was **artifact-time**: the repair must |
| 13 | +fit in the 16,000,000 byte cap. SDClip wins (-0.00086 to -0.00216 BPB) all |
| 14 | +overflow because each tightening step costs ~250 KB of artifact size. |
| 15 | + |
| 16 | +**This spec moves the repair to deploy-time** — runs on the leaderboard hardware |
| 17 | +at eval time, using the **100-180s of unused eval budget** that PR #1797 leaves |
| 18 | +on the table (PR uses 423-495s of 600s cap). |
| 19 | + |
| 20 | +Deploy-time repair pays compute, not bytes. **Bypasses the byte-cap problem entirely.** |
| 21 | + |
| 22 | +## Hypothesis |
| 23 | + |
| 24 | +After deserialize loads the quantized model, run a small fit step that updates |
| 25 | +the in-memory model's passthrough fp16 params (`attn_scale`, `mlp_scale`, |
| 26 | +`resid_mix`, etc.) to compensate for accumulated quant error. Use AR self-generated |
| 27 | +text as both calibration data AND optimization target (next-token CE loss). |
| 28 | + |
| 29 | +This is similar in spirit to TTT but: |
| 30 | +- Updates passthrough params (not LoRA adapters) |
| 31 | +- Uses AR-self-gen text (not val data — no leak) |
| 32 | +- Runs ONCE at deserialize, not per-document |
| 33 | +- Total deploy-time cost: ~60s (fits in 100-180s headroom) |
| 34 | + |
| 35 | +## Why this should work where 046E failed |
| 36 | + |
| 37 | +046E's `fit_passthrough_params_to_match_base` fit AT ARTIFACT TIME on training |
| 38 | +data, then DISCARDED the fitted values (in-memory only, didn't propagate to |
| 39 | +artifact). Result: cost +0.0004 BPB with no shipping value. |
| 40 | + |
| 41 | +Deploy-time fit: |
| 42 | +- Values applied to the CURRENTLY-RUNNING model |
| 43 | +- Directly impacts the eval being scored |
| 44 | +- TTT works exactly this way and recovers ~-0.013 BPB |
| 45 | + |
| 46 | +Whether the small per-channel param fit can do anything similar to TTT's full |
| 47 | +LoRA fit is the open question. |
| 48 | + |
| 49 | +## Code components |
| 50 | + |
| 51 | +We already have most of the pieces from 046E + 046F: |
| 52 | + |
| 53 | +1. **`fit_passthrough_params_to_match_base`** (046E, in train_gpt.py at commit |
| 54 | + 381baf2) — currently fits to match `base_model.forward_logits`. Need to |
| 55 | + refactor: fit against AR-self-gen data + CE loss on next tokens. |
| 56 | + |
| 57 | +2. **`ARSelfGenCalibLoader`** (046F, same commit) — already generates AR |
| 58 | + samples without val data leak. Reuse directly. |
| 59 | + |
| 60 | +3. **New wiring** in `train_and_eval()` after `deserialize()`: |
| 61 | + ```python |
| 62 | + eval_model = deserialize(h, device) |
| 63 | + if h.num_loops > 0: |
| 64 | + eval_model.looping_active = True |
| 65 | + eval_model.looping_depth = h.num_loops |
| 66 | + |
| 67 | + # NEW: deploy-time quant repair |
| 68 | + if h.deploy_time_repair_enabled: |
| 69 | + repair_calib = generate_ar_calib(eval_model, h, n_batches=8, seq_len=512) |
| 70 | + fit_passthrough_to_self_consistency(eval_model, repair_calib, h) |
| 71 | + |
| 72 | + # then proceed to TTT + eval |
| 73 | + ``` |
| 74 | + |
| 75 | +4. **New objective**: instead of MSE-vs-teacher, use cross-entropy on next-token |
| 76 | + prediction over AR-generated tokens. The model trained on next-token |
| 77 | + prediction; if quant noise degraded that, fitting on its own AR samples |
| 78 | + should recover. |
| 79 | + |
| 80 | +## Env vars |
| 81 | + |
| 82 | +```bash |
| 83 | +DEPLOY_TIME_REPAIR_ENABLED=1 # gate (default 0) |
| 84 | +DEPLOY_TIME_REPAIR_ITERS=5 # AdamW iter count |
| 85 | +DEPLOY_TIME_REPAIR_LR=1e-3 # learning rate |
| 86 | +DEPLOY_TIME_REPAIR_BATCHES=8 # batches per iter |
| 87 | +DEPLOY_TIME_REPAIR_AR_SEQ_LEN=512 # AR generation length |
| 88 | +DEPLOY_TIME_REPAIR_AR_TEMP=1.0 # AR sampling temperature |
| 89 | +``` |
| 90 | + |
| 91 | +Default OFF so existing runs unaffected. |
| 92 | + |
| 93 | +## Rules legality (CRITICAL — verify before committing) |
| 94 | + |
| 95 | +Deploy-time techniques are a gray area. Need to verify in the challenge rules: |
| 96 | + |
| 97 | +1. **Is updating model parameters at eval time allowed?** TTT does this clearly. This is similar in spirit but updates a different set of params. |
| 98 | +2. **AR-self-gen calibration data**: should be unambiguously legal (model generates from BOS, no val data touched) |
| 99 | +3. **Total eval time**: must fit in 600s budget. Repair cost ~60s + TTT ~300-450s + final eval ~30s = ~390-540s. Fits. |
| 100 | +4. **Determinism**: AR-gen with fixed seed should be deterministic. Verify. |
| 101 | + |
| 102 | +**Action item before code work**: read `records/track_10min_16mb/README.md` and |
| 103 | +the official challenge rules for what's legal at eval time. If unclear, ask in |
| 104 | +the openai/parameter-golf discussions. |
| 105 | + |
| 106 | +## Arms (after code lands) |
| 107 | + |
| 108 | +### Phase 1 — sanity / cost check (1 arm) |
| 109 | + |
| 110 | +| Arm | Config | Tests | |
| 111 | +|---|---|---| |
| 112 | +| **046L-baseline** | DEPLOY_TIME_REPAIR_ENABLED=1, defaults (5 iter × 8 batches × 512 seq) | does it run cleanly? does val_bpb improve at all? | |
| 113 | + |
| 114 | +### Phase 2 — sweep iters/lr (3 arms) |
| 115 | + |
| 116 | +| Arm | Iters | LR | Batches | |
| 117 | +|---|---|---|---| |
| 118 | +| **046L-iters3-lr1e3** | 3 | 1e-3 | 8 | |
| 119 | +| **046L-iters5-lr3e4** | 5 | 3e-4 | 8 | |
| 120 | +| **046L-iters10-lr1e3** | 10 | 1e-3 | 16 | |
| 121 | + |
| 122 | +### Phase 3 — combine with SDClip win (1 arm) |
| 123 | + |
| 124 | +If 046L baseline shows real BPB improvement at deploy time, we can use |
| 125 | +deploy-time repair to compensate for an OVER-CAP SDClip variant if we can |
| 126 | +also save bytes elsewhere. But without bytes, just test: |
| 127 | + |
| 128 | +| Arm | Config | |
| 129 | +|---|---| |
| 130 | +| **046L-on-baseline-stack** | deploy-time repair on current legal baseline | |
| 131 | + |
| 132 | +Total: 5 arms, ~$5, ~30 min wallclock. |
| 133 | + |
| 134 | +## Acceptance |
| 135 | + |
| 136 | +Reference = 046 verification (1.07467 quantized, no deploy repair). |
| 137 | + |
| 138 | +Per arm: |
| 139 | +- **Strong win**: quantized < 1.0735 (-0.0012) |
| 140 | +- **Win**: quantized < 1.0739 (-0.0008) |
| 141 | +- **Marginal**: 1.0739–1.0746 |
| 142 | +- **Null**: 1.0746–1.0750 |
| 143 | +- **Hurts**: > 1.0750 |
| 144 | +- **Bug**: NaN or fails to deserialize |
| 145 | + |
| 146 | +Eval time check: `eval_time` log line MUST stay < 600s on the leaderboard simulation. |
| 147 | + |
| 148 | +## Risk assessment |
| 149 | + |
| 150 | +- **Rules legality**: medium — could be ruled out, kills direction |
| 151 | +- **Implementation**: low-medium — most code exists in 046E/F; mainly refactoring |
| 152 | +- **Outcome uncertainty**: medium-high — 046E artifact-time was negative; deploy-time MIGHT also be null if the passthrough params genuinely can't compensate for matrix quant error |
| 153 | +- **Eval-time budget overflow**: low — 60s addition fits in 100-180s headroom |
| 154 | + |
| 155 | +## What success looks like |
| 156 | + |
| 157 | +If 046L-baseline gives ANY net positive (>0.0005 BPB win), this is a paradigm |
| 158 | +unlock: |
| 159 | +- Future submissions can spend ~60s of eval time for free BPB |
| 160 | +- Combined with eventual SDClip+byte-saving wins, could be substantial |
| 161 | +- New direction worth deeper exploration (different objectives, more iters, |
| 162 | + different param sets) |
| 163 | + |
| 164 | +If 046L-baseline gives ~null: |
| 165 | +- Confirms passthrough param capacity is too small to meaningfully repair |
| 166 | + quant error (matches 046E artifact-time finding) |
| 167 | +- Direction closes — deploy-time only helps if you have MORE expressive |
| 168 | + params to update (e.g., LoRA, which is what TTT does) |
| 169 | +- We then know definitively that quant repair via passthrough params is a |
| 170 | + dead direction in any setting |
| 171 | + |
| 172 | +## Cost summary |
| 173 | + |
| 174 | +- Code: ~30-60 min (refactoring existing 046E/F into eval path + new objective) |
| 175 | +- Test: ~$5 across 5 arms |
| 176 | +- Total time: ~1 hour engineering + ~30 min wallclock |
| 177 | + |
| 178 | +## Decision tree |
| 179 | + |
| 180 | +| Outcome | Next | |
| 181 | +|---|---| |
| 182 | +| Rules-illegal | Direction closed; pivot to other fundamentally-new ideas (B, C, E from quant-repair-fundamentally-new.md) | |
| 183 | +| 046L-baseline wins ≥ -0.001 | Sweep params (iters, LR); explore combining with TTT | |
| 184 | +| 046L-baseline ~null | Try a richer param set (e.g., a tiny LoRA-style adapter); else close direction | |
| 185 | +| 046L hurts | Bug or fundamental incompatibility; investigate | |
| 186 | + |
| 187 | +## Pre-requisite checklist |
| 188 | + |
| 189 | +- [ ] Verify rules-legality of deploy-time param updates (beyond TTT's LoRA) |
| 190 | +- [ ] Read challenge README for any restrictions |
| 191 | +- [ ] Confirm AR-self-gen calib qualifies as no-val-leak |
| 192 | +- [ ] Plan code: refactor `fit_passthrough_params_to_match_base` to use CE-on-AR objective |
| 193 | +- [ ] Verify deserialize leaves passthrough params trainable (currently they're loaded as fp16 buffers; need to mark as Parameter) |
0 commit comments