[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"#1837
Conversation
…07063, healing-property observation
|
@X-Abhishek-X — great work on the healing-property observation and the distributed lockstep gradient sync. This is exactly the E2E TTT framing that makes the wishlist item concrete. We just shipped a complementary implementation in #2059 (feat/golden-sunflowers-jepa-universal-nta): an env-var-gated A few open questions where your 1.07063 result is the only public data point we know of:
We're building a multi-seed sweep fleet (Railway × 8 accounts, Neon embargo ledger, 3-seed mean gates at BPB < 1.85 and < 1.50). If you'd want to co-run your checkpoint through the same pipeline and get a reproducible 3-seed mean + a Zenodo DOI for the healing-property result, we're open to it. For the inner-LR band, there's a Coq.Reals theorem |
Summary
This is a non-record / wishlist submission addressing the openai/parameter-golf README §Requests for PRs item: "State-space models, E2E TTT, super long context for evaluation or training".
A working full-model E2E TTT implementation with distributed lockstep gradient synchronization, demonstrating the wishlist item end-to-end on top of my existing PR #1695 record submission.
Result
My original contributions in this submission
eval_val_e2e_ttt+_select_e2e_ttt_paramsin train_gpt.py) — full-model SGD per chunk, generalizes chunk-LoRA Phased TTT (PR [Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695) to the full parameter setall_reduce(MEAN)across all 8 ranks beforeoptimizer.step, ensuring every rank's E2E TTT trajectory stays byte-identicalKey observation — "healing property"
SpinQuant + GPTQ degraded the post-quant model from a pre-quant val_bpb of 1.07125 to 6.47968 (a 5.4 BPB regression — model is essentially broken on cold inference). E2E TTT recovered it to 1.07063 within the eval window — fully healing the quantization damage and slightly exceeding the pre-quant ceiling.
This suggests aggressive quantization may be more recoverable than commonly assumed when paired with full-model TTT. Worth further investigation as a wishlist research direction.
Concurrent / related work — @taka6745 #1818
@taka6745's concurrent PR #1818 ("Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT") characterizes a related effect from a different angle: GPTQ-int6 → pre-quant 1.1009, post-quant 3.4620, post-TTT 2.7663 (3-seed). Their submission documents partial TTT recovery (~30 % of the damage gap closed by sliding-window TTT) on a smaller initial damage (+2.36 BPB).
This submission is a complementary data point: a more aggressive quantization regime (SpinQuant + GPTQ, +5.4 BPB damage) paired with a stronger TTT variant (full-model E2E SGD with distributed lockstep grad-sync), yielding near-complete recovery (~99 % of the damage gap closed, slightly exceeding pre-quant). Together the two PRs suggest that the recoverability of post-quant damage scales meaningfully with TTT capacity.
Files
Companion record
PR #1695 — Stage3 + SpinQuant V1 + MP-SGD-TTT, val_bpb 1.07590 (3-seed mean, std 0.00019).
Lineage credit
This submission builds on the bigbag #1493 architectural lineage (the standard parameter-golf base most top PRs fork from), and uses the legal score-first TTT framework established via @valerio-oai (Issue #402) and the original chunked TTT pattern from abaybektursun (#549). The novel contribution here is the full-model SGD generalization with distributed lockstep grad-sync, plus the healing-property observation.