Skip to content

Commit 331139b

Browse files
committed
spec 046L: deploy-time quant repair using eval headroom
Bypasses the 16,000,000 byte cap by running quant repair AT DEPLOY TIME on the leaderboard hardware, using PR openai#1797's unused 100-180s of eval budget. Mechanism: 1. After deserialize, generate AR self-gen calib (~30s, reuse 046F code) 2. Fit passthrough fp16 params (attn_scale, mlp_scale, resid_mix, ...) to minimize next-token CE loss on AR samples (~30s, refactor 046E code) 3. Then run TTT + final eval as normal Total eval-time addition: ~60s. Fits in 100-180s headroom. Code reuses 80% of existing 046E + 046F infrastructure (commit 381baf2). Estimated ~50-100 lines new code (mostly wiring + new CE objective). 5 arms across 3 phases, ~$5 total, ~30 min wallclock to validate. CRITICAL pre-req: verify rules-legality of deploy-time param updates beyond TTT's LoRA pattern. Spec includes pre-requisite checklist. If this works, paradigm unlock: future submissions get free BPB from idle eval-time compute, bypassing the byte-cap entirely.
1 parent d3a2e43 commit 331139b

1 file changed

Lines changed: 193 additions & 0 deletions

File tree

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# Spec 046L — deploy-time quant repair (use eval headroom, bypass byte cap)
2+
3+
**Slug:** `046L-deploy-time-quant-repair`
4+
**Created:** 2026-04-27
5+
**Status:** DRAFT — needs ~50-100 lines code (mostly wiring existing 046E/F)
6+
**Branch:** `exp/046-quant-repair`
7+
**Commit:** TBD (after code lands)
8+
**Parent:** `research/ideas/quant-repair-fundamentally-new.md` (Idea A — highest EV)
9+
10+
## The fundamental shift
11+
12+
Every quant-repair lever we tried (046A-K) was **artifact-time**: the repair must
13+
fit in the 16,000,000 byte cap. SDClip wins (-0.00086 to -0.00216 BPB) all
14+
overflow because each tightening step costs ~250 KB of artifact size.
15+
16+
**This spec moves the repair to deploy-time** — runs on the leaderboard hardware
17+
at eval time, using the **100-180s of unused eval budget** that PR #1797 leaves
18+
on the table (PR uses 423-495s of 600s cap).
19+
20+
Deploy-time repair pays compute, not bytes. **Bypasses the byte-cap problem entirely.**
21+
22+
## Hypothesis
23+
24+
After deserialize loads the quantized model, run a small fit step that updates
25+
the in-memory model's passthrough fp16 params (`attn_scale`, `mlp_scale`,
26+
`resid_mix`, etc.) to compensate for accumulated quant error. Use AR self-generated
27+
text as both calibration data AND optimization target (next-token CE loss).
28+
29+
This is similar in spirit to TTT but:
30+
- Updates passthrough params (not LoRA adapters)
31+
- Uses AR-self-gen text (not val data — no leak)
32+
- Runs ONCE at deserialize, not per-document
33+
- Total deploy-time cost: ~60s (fits in 100-180s headroom)
34+
35+
## Why this should work where 046E failed
36+
37+
046E's `fit_passthrough_params_to_match_base` fit AT ARTIFACT TIME on training
38+
data, then DISCARDED the fitted values (in-memory only, didn't propagate to
39+
artifact). Result: cost +0.0004 BPB with no shipping value.
40+
41+
Deploy-time fit:
42+
- Values applied to the CURRENTLY-RUNNING model
43+
- Directly impacts the eval being scored
44+
- TTT works exactly this way and recovers ~-0.013 BPB
45+
46+
Whether the small per-channel param fit can do anything similar to TTT's full
47+
LoRA fit is the open question.
48+
49+
## Code components
50+
51+
We already have most of the pieces from 046E + 046F:
52+
53+
1. **`fit_passthrough_params_to_match_base`** (046E, in train_gpt.py at commit
54+
381baf2) — currently fits to match `base_model.forward_logits`. Need to
55+
refactor: fit against AR-self-gen data + CE loss on next tokens.
56+
57+
2. **`ARSelfGenCalibLoader`** (046F, same commit) — already generates AR
58+
samples without val data leak. Reuse directly.
59+
60+
3. **New wiring** in `train_and_eval()` after `deserialize()`:
61+
```python
62+
eval_model = deserialize(h, device)
63+
if h.num_loops > 0:
64+
eval_model.looping_active = True
65+
eval_model.looping_depth = h.num_loops
66+
67+
# NEW: deploy-time quant repair
68+
if h.deploy_time_repair_enabled:
69+
repair_calib = generate_ar_calib(eval_model, h, n_batches=8, seq_len=512)
70+
fit_passthrough_to_self_consistency(eval_model, repair_calib, h)
71+
72+
# then proceed to TTT + eval
73+
```
74+
75+
4. **New objective**: instead of MSE-vs-teacher, use cross-entropy on next-token
76+
prediction over AR-generated tokens. The model trained on next-token
77+
prediction; if quant noise degraded that, fitting on its own AR samples
78+
should recover.
79+
80+
## Env vars
81+
82+
```bash
83+
DEPLOY_TIME_REPAIR_ENABLED=1 # gate (default 0)
84+
DEPLOY_TIME_REPAIR_ITERS=5 # AdamW iter count
85+
DEPLOY_TIME_REPAIR_LR=1e-3 # learning rate
86+
DEPLOY_TIME_REPAIR_BATCHES=8 # batches per iter
87+
DEPLOY_TIME_REPAIR_AR_SEQ_LEN=512 # AR generation length
88+
DEPLOY_TIME_REPAIR_AR_TEMP=1.0 # AR sampling temperature
89+
```
90+
91+
Default OFF so existing runs unaffected.
92+
93+
## Rules legality (CRITICAL — verify before committing)
94+
95+
Deploy-time techniques are a gray area. Need to verify in the challenge rules:
96+
97+
1. **Is updating model parameters at eval time allowed?** TTT does this clearly. This is similar in spirit but updates a different set of params.
98+
2. **AR-self-gen calibration data**: should be unambiguously legal (model generates from BOS, no val data touched)
99+
3. **Total eval time**: must fit in 600s budget. Repair cost ~60s + TTT ~300-450s + final eval ~30s = ~390-540s. Fits.
100+
4. **Determinism**: AR-gen with fixed seed should be deterministic. Verify.
101+
102+
**Action item before code work**: read `records/track_10min_16mb/README.md` and
103+
the official challenge rules for what's legal at eval time. If unclear, ask in
104+
the openai/parameter-golf discussions.
105+
106+
## Arms (after code lands)
107+
108+
### Phase 1 — sanity / cost check (1 arm)
109+
110+
| Arm | Config | Tests |
111+
|---|---|---|
112+
| **046L-baseline** | DEPLOY_TIME_REPAIR_ENABLED=1, defaults (5 iter × 8 batches × 512 seq) | does it run cleanly? does val_bpb improve at all? |
113+
114+
### Phase 2 — sweep iters/lr (3 arms)
115+
116+
| Arm | Iters | LR | Batches |
117+
|---|---|---|---|
118+
| **046L-iters3-lr1e3** | 3 | 1e-3 | 8 |
119+
| **046L-iters5-lr3e4** | 5 | 3e-4 | 8 |
120+
| **046L-iters10-lr1e3** | 10 | 1e-3 | 16 |
121+
122+
### Phase 3 — combine with SDClip win (1 arm)
123+
124+
If 046L baseline shows real BPB improvement at deploy time, we can use
125+
deploy-time repair to compensate for an OVER-CAP SDClip variant if we can
126+
also save bytes elsewhere. But without bytes, just test:
127+
128+
| Arm | Config |
129+
|---|---|
130+
| **046L-on-baseline-stack** | deploy-time repair on current legal baseline |
131+
132+
Total: 5 arms, ~$5, ~30 min wallclock.
133+
134+
## Acceptance
135+
136+
Reference = 046 verification (1.07467 quantized, no deploy repair).
137+
138+
Per arm:
139+
- **Strong win**: quantized < 1.0735 (-0.0012)
140+
- **Win**: quantized < 1.0739 (-0.0008)
141+
- **Marginal**: 1.0739–1.0746
142+
- **Null**: 1.0746–1.0750
143+
- **Hurts**: > 1.0750
144+
- **Bug**: NaN or fails to deserialize
145+
146+
Eval time check: `eval_time` log line MUST stay < 600s on the leaderboard simulation.
147+
148+
## Risk assessment
149+
150+
- **Rules legality**: medium — could be ruled out, kills direction
151+
- **Implementation**: low-medium — most code exists in 046E/F; mainly refactoring
152+
- **Outcome uncertainty**: medium-high — 046E artifact-time was negative; deploy-time MIGHT also be null if the passthrough params genuinely can't compensate for matrix quant error
153+
- **Eval-time budget overflow**: low — 60s addition fits in 100-180s headroom
154+
155+
## What success looks like
156+
157+
If 046L-baseline gives ANY net positive (>0.0005 BPB win), this is a paradigm
158+
unlock:
159+
- Future submissions can spend ~60s of eval time for free BPB
160+
- Combined with eventual SDClip+byte-saving wins, could be substantial
161+
- New direction worth deeper exploration (different objectives, more iters,
162+
different param sets)
163+
164+
If 046L-baseline gives ~null:
165+
- Confirms passthrough param capacity is too small to meaningfully repair
166+
quant error (matches 046E artifact-time finding)
167+
- Direction closes — deploy-time only helps if you have MORE expressive
168+
params to update (e.g., LoRA, which is what TTT does)
169+
- We then know definitively that quant repair via passthrough params is a
170+
dead direction in any setting
171+
172+
## Cost summary
173+
174+
- Code: ~30-60 min (refactoring existing 046E/F into eval path + new objective)
175+
- Test: ~$5 across 5 arms
176+
- Total time: ~1 hour engineering + ~30 min wallclock
177+
178+
## Decision tree
179+
180+
| Outcome | Next |
181+
|---|---|
182+
| Rules-illegal | Direction closed; pivot to other fundamentally-new ideas (B, C, E from quant-repair-fundamentally-new.md) |
183+
| 046L-baseline wins ≥ -0.001 | Sweep params (iters, LR); explore combining with TTT |
184+
| 046L-baseline ~null | Try a richer param set (e.g., a tiny LoRA-style adapter); else close direction |
185+
| 046L hurts | Bug or fundamental incompatibility; investigate |
186+
187+
## Pre-requisite checklist
188+
189+
- [ ] Verify rules-legality of deploy-time param updates (beyond TTT's LoRA)
190+
- [ ] Read challenge README for any restrictions
191+
- [ ] Confirm AR-self-gen calib qualifies as no-val-leak
192+
- [ ] Plan code: refactor `fit_passthrough_params_to_match_base` to use CE-on-AR objective
193+
- [ ] Verify deserialize leaves passthrough params trainable (currently they're loaded as fp16 buffers; need to mark as Parameter)

0 commit comments

Comments
 (0)