[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property" by X-Abhishek-X · Pull Request #1837 · openai/parameter-golf

X-Abhishek-X · 2026-04-26T17:15:11Z

Summary

This is a non-record / wishlist submission addressing the openai/parameter-golf README §Requests for PRs item: "State-space models, E2E TTT, super long context for evaluation or training".

A working full-model E2E TTT implementation with distributed lockstep gradient synchronization, demonstrating the wishlist item end-to-end on top of my existing PR #1695 record submission.

Result

val_bpb 1.07063 on the same checkpoint as my record submission [Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695 (1.07590)
−0.00527 BPB improvement over my own PR [Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695 baseline via E2E TTT alone
Non-record because eval time (1292s) exceeds the 600s record cap by design
1-seed (consistent with other non-record submissions like Will DePue's 4-hour baseline and Ciprian Ifrim's 1-bit submission)
Artifact 15,961,787 B (under 16 MB cap)
All training/eval hyperparameters in submission.json; full proof in e2e_proof.log

My original contributions in this submission

E2E TTT implementation (eval_val_e2e_ttt + _select_e2e_ttt_params in train_gpt.py) — full-model SGD per chunk, generalizes chunk-LoRA Phased TTT (PR [Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695) to the full parameter set
Distributed lockstep gradient sync — all_reduce(MEAN) across all 8 ranks before optimizer.step, ensuring every rank's E2E TTT trajectory stays byte-identical
"Healing property" empirical observation (see below) — first reported here
Companion record PR [Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695: SpinQuant V1 + MP-SGD-TTT recipe, val_bpb 1.07590 (3-seed mean, std 0.00019), an improvement of −0.025 BPB over the bigbag Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 base
Hyperparameter tunings (WD=0.095, MLR=0.022, EMA=0.9965) shipped in PR [Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445 / [Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866 #1471, also credited by @PranavViswanath in PR Record: SP8192 + Gram-NS + Polar Express + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0800 (3-seed mean) #1809

Key observation — "healing property"

SpinQuant + GPTQ degraded the post-quant model from a pre-quant val_bpb of 1.07125 to 6.47968 (a 5.4 BPB regression — model is essentially broken on cold inference). E2E TTT recovered it to 1.07063 within the eval window — fully healing the quantization damage and slightly exceeding the pre-quant ceiling.

This suggests aggressive quantization may be more recoverable than commonly assumed when paired with full-model TTT. Worth further investigation as a wishlist research direction.

Concurrent / related work — @taka6745 #1818

@taka6745's concurrent PR #1818 ("Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT") characterizes a related effect from a different angle: GPTQ-int6 → pre-quant 1.1009, post-quant 3.4620, post-TTT 2.7663 (3-seed). Their submission documents partial TTT recovery (~30 % of the damage gap closed by sliding-window TTT) on a smaller initial damage (+2.36 BPB).

This submission is a complementary data point: a more aggressive quantization regime (SpinQuant + GPTQ, +5.4 BPB damage) paired with a stronger TTT variant (full-model E2E SGD with distributed lockstep grad-sync), yielding near-complete recovery (~99 % of the damage gap closed, slightly exceeding pre-quant). Together the two PRs suggest that the recoverability of post-quant damage scales meaningfully with TTT capacity.

Files

File	Purpose
README.md	Submission readme
PORTFOLIO_SUMMARY.md	Full writeup with attribution + negative-result context
submission.json	Metadata, scores, hyperparameters
train_gpt.py	Patched training/eval script (MD5 4397db0c9025478d0251434044f0df44)
e2e_proof.log	Run log proving val_bpb 1.07063 (MD5 6e6bd78df1e1acb2a1f9a0b45123865b)

Companion record

PR #1695 — Stage3 + SpinQuant V1 + MP-SGD-TTT, val_bpb 1.07590 (3-seed mean, std 0.00019).

Lineage credit

This submission builds on the bigbag #1493 architectural lineage (the standard parameter-golf base most top PRs fork from), and uses the legal score-first TTT framework established via @valerio-oai (Issue #402) and the original chunked TTT pattern from abaybektursun (#549). The novel contribution here is the full-model SGD generalization with distributed lockstep grad-sync, plus the healing-property observation.

…07063, healing-property observation

gHashTag · 2026-05-02T10:19:23Z

@X-Abhishek-X — great work on the healing-property observation and the distributed lockstep gradient sync. This is exactly the E2E TTT framing that makes the wishlist item concrete.

We just shipped a complementary implementation in #2059 (feat/golden-sunflowers-jepa-universal-nta): an env-var-gated _e2e_ttt_inner_step() that wraps the same full-model per-chunk SGD approach, gated on TTT_INNER_STEPS=0 (default = no-op, state-dict byte-identical to the merged baseline — 3/3 equivalence proof committed).

A few open questions where your 1.07063 result is the only public data point we know of:

3-seed reproducibility. Your submission is 1-seed. Do you have F_18 = 2584 / F_19 = 4181 runs sitting around, even partial?
Interaction with quantization depth. You observed near-complete recovery (~99 %) with SpinQuant + GPTQ; @taka6745 in Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100) #1818 saw ~30 % recovery with GPTQ-int6 alone. The TTT-capacity-vs-damage hypothesis you raise seems worth a controlled ablation.
Inner-step LR schedule. TTT_LR_INNER in our impl is a flat scalar — did you tune a schedule, or is the improvement mostly from full-model vs. chunk-LoRA scope?

We're building a multi-seed sweep fleet (Railway × 8 accounts, Neon embargo ledger, 3-seed mean gates at BPB < 1.85 and < 1.50). If you'd want to co-run your checkpoint through the same pipeline and get a reproducible 3-seed mean + a Zenodo DOI for the healing-property result, we're open to it.

For the inner-LR band, there's a Coq.Reals theorem alpha_phi_times_phi_cubed (SAC-1, Qed — Print Assumptions clean) that pins α_φ · φ³ = 1/2, i.e. α_φ = φ⁻³ / 2 ≈ 0.1180, which gave us a formal starting point for TTT_LR_INNER. Happy to share the .v file if useful as a formal baseline for your hyperparameter write-up — no obligation.

Non-record (wishlist): E2E TTT — full-model SGD per chunk, val_bpb 1.…

b87b494

…07063, healing-property observation

This was referenced May 2, 2026

🌻 EPIC — E2E TTT Pipeline O(1) · Ring-Pattern Refactor (3 GOLD × ~19 SR) gHashTag/trios#446

Open

⭐ [SR-ALG-03] e2e-ttt — beat parameter-golf #1837 (val_bpb < 1.07063) gHashTag/trios#457

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"#1837

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"#1837
X-Abhishek-X wants to merge 1 commit intoopenai:mainfrom
X-Abhishek-X:e2e-ttt-wishlist-non-record

X-Abhishek-X commented Apr 26, 2026 •

edited

Loading

Uh oh!

gHashTag commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

X-Abhishek-X commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Result

My original contributions in this submission

Key observation — "healing property"

Concurrent / related work — @taka6745 #1818

Files

Companion record

Lineage credit

Uh oh!

gHashTag commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

X-Abhishek-X commented Apr 26, 2026 •

edited

Loading