Skip to content

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"#1837

Open
X-Abhishek-X wants to merge 1 commit intoopenai:mainfrom
X-Abhishek-X:e2e-ttt-wishlist-non-record
Open

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"#1837
X-Abhishek-X wants to merge 1 commit intoopenai:mainfrom
X-Abhishek-X:e2e-ttt-wishlist-non-record

Conversation

@X-Abhishek-X
Copy link
Copy Markdown

@X-Abhishek-X X-Abhishek-X commented Apr 26, 2026

Summary

This is a non-record / wishlist submission addressing the openai/parameter-golf README §Requests for PRs item: "State-space models, E2E TTT, super long context for evaluation or training".

A working full-model E2E TTT implementation with distributed lockstep gradient synchronization, demonstrating the wishlist item end-to-end on top of my existing PR #1695 record submission.

Result

My original contributions in this submission

Key observation — "healing property"

SpinQuant + GPTQ degraded the post-quant model from a pre-quant val_bpb of 1.07125 to 6.47968 (a 5.4 BPB regression — model is essentially broken on cold inference). E2E TTT recovered it to 1.07063 within the eval window — fully healing the quantization damage and slightly exceeding the pre-quant ceiling.

This suggests aggressive quantization may be more recoverable than commonly assumed when paired with full-model TTT. Worth further investigation as a wishlist research direction.

Concurrent / related work — @taka6745 #1818

@taka6745's concurrent PR #1818 ("Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT") characterizes a related effect from a different angle: GPTQ-int6 → pre-quant 1.1009, post-quant 3.4620, post-TTT 2.7663 (3-seed). Their submission documents partial TTT recovery (~30 % of the damage gap closed by sliding-window TTT) on a smaller initial damage (+2.36 BPB).

This submission is a complementary data point: a more aggressive quantization regime (SpinQuant + GPTQ, +5.4 BPB damage) paired with a stronger TTT variant (full-model E2E SGD with distributed lockstep grad-sync), yielding near-complete recovery (~99 % of the damage gap closed, slightly exceeding pre-quant). Together the two PRs suggest that the recoverability of post-quant damage scales meaningfully with TTT capacity.

Files

File Purpose
README.md Submission readme
PORTFOLIO_SUMMARY.md Full writeup with attribution + negative-result context
submission.json Metadata, scores, hyperparameters
train_gpt.py Patched training/eval script (MD5 4397db0c9025478d0251434044f0df44)
e2e_proof.log Run log proving val_bpb 1.07063 (MD5 6e6bd78df1e1acb2a1f9a0b45123865b)

Companion record

PR #1695 — Stage3 + SpinQuant V1 + MP-SGD-TTT, val_bpb 1.07590 (3-seed mean, std 0.00019).

Lineage credit

This submission builds on the bigbag #1493 architectural lineage (the standard parameter-golf base most top PRs fork from), and uses the legal score-first TTT framework established via @valerio-oai (Issue #402) and the original chunked TTT pattern from abaybektursun (#549). The novel contribution here is the full-model SGD generalization with distributed lockstep grad-sync, plus the healing-property observation.

@gHashTag
Copy link
Copy Markdown

gHashTag commented May 2, 2026

@X-Abhishek-X — great work on the healing-property observation and the distributed lockstep gradient sync. This is exactly the E2E TTT framing that makes the wishlist item concrete.

We just shipped a complementary implementation in #2059 (feat/golden-sunflowers-jepa-universal-nta): an env-var-gated _e2e_ttt_inner_step() that wraps the same full-model per-chunk SGD approach, gated on TTT_INNER_STEPS=0 (default = no-op, state-dict byte-identical to the merged baseline — 3/3 equivalence proof committed).

A few open questions where your 1.07063 result is the only public data point we know of:

  1. 3-seed reproducibility. Your submission is 1-seed. Do you have F_18 = 2584 / F_19 = 4181 runs sitting around, even partial?

  2. Interaction with quantization depth. You observed near-complete recovery (~99 %) with SpinQuant + GPTQ; @taka6745 in Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100) #1818 saw ~30 % recovery with GPTQ-int6 alone. The TTT-capacity-vs-damage hypothesis you raise seems worth a controlled ablation.

  3. Inner-step LR schedule. TTT_LR_INNER in our impl is a flat scalar — did you tune a schedule, or is the improvement mostly from full-model vs. chunk-LoRA scope?

We're building a multi-seed sweep fleet (Railway × 8 accounts, Neon embargo ledger, 3-seed mean gates at BPB < 1.85 and < 1.50). If you'd want to co-run your checkpoint through the same pipeline and get a reproducible 3-seed mean + a Zenodo DOI for the healing-property result, we're open to it.

For the inner-LR band, there's a Coq.Reals theorem alpha_phi_times_phi_cubed (SAC-1, Qed — Print Assumptions clean) that pins α_φ · φ³ = 1/2, i.e. α_φ = φ⁻³ / 2 ≈ 0.1180, which gave us a formal starting point for TTT_LR_INNER. Happy to share the .v file if useful as a formal baseline for your hyperparameter write-up — no obligation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants