Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean) by yahya010 · Pull Request #1727 · openai/parameter-golf

yahya010 · 2026-04-18T23:55:57Z

Summary

3-seed mean val_bpb = 1.07217 (std 0.00114) on Track A (10min/16MB), beating merged SOTA (1.0810, @bigbag PR #1493) by 0.00883 BPB / ~0.00612 nats, clearing the 0.005-nat / ~0.0072 BPB threshold.

This PR extends the Multi-Phase Global SGD + Phased LoRA TTT stack from open PR #1700 (@jorge-asenjo) with two hyperparameter tunes:

PHASED_TTT_NUM_PHASES=4 (up from PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700's 3) — one additional multi-phase SGD adaptation pass at eval time. PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700's 3-phase eval used ~352s out of the 600s budget; adding a 4th phase adds ~25-70s and fits comfortably (observed eval times: 349-396s across all 3 seeds).
QK_GAIN_INIT=5.25 (up from PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700's 5.0) — matches merged SOTA PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493.

train_gpt.py is byte-identical to PR #1700's. The only changes are the two environment variables above.

Results (8×H100 SXM, torch 2.9.1+cu128)

Seed	val_bpb	artifact bytes
42	1.07310	15,933,641
0	1.07090	15,938,690
1234	1.07250	15,930,318
mean	1.07217
std	0.00114

Comparison with PR #1700 (same code, same seeds, phases=3):

Seed	#1700 val_bpb	this PR (phases=4)	delta
42	1.07332	1.07310	-0.00022
0	1.07115	1.07090	-0.00025
1234	1.07211	1.07250	+0.00039
mean	1.07219	1.07217	-0.00002

The extra 4th phase yields a very small (within-noise) mean delta versus PR #1700. The submission clears the merged-SOTA threshold through PR #1700's stack; the novelty this PR contributes is (a) independent reproduction on different hardware, (b) verification that an additional 4th MP-SGD phase is legal and fits the budget, and (c) the QK-Gain 5.25 import from merged SOTA.

Compliance (Issue #1017 Track A)

Condition 1 (Causality): standard causal attention
Condition 2 (Normalized): standard softmax over full vocab
Condition 3 (Score-before-update): each of the 4 phases is scored under torch.no_grad() before any SGD update
Condition 4 (Single pass): each token is scored exactly once per phase, no rescoring across phases
No SLOT, no n-gram cache, no ETLB, no pre-quant TTT
All 3 artifacts < 16 MB (max 15,938,690 B)
All 3 trainings < 600s (wallclock-capped at 596s)
All 3 evals < 600s (348s / 349s / 396s)

Reproduction

pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
pip install brotli sentencepiece python-minifier numpy huggingface-hub zstandard einops ninja

MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
  python3 data/cached_challenge_fineweb.py --variant sp8192

for seed in 42 0 1234; do
  SEED=$seed \
  QK_GAIN_INIT=5.25 \
  PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=4 \
  MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
  MATRIX_LR=0.026 GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py
done

Attribution

Full credit to @jorge-asenjo (PR #1700) for the base stack: Multi-Phase Global SGD TTT + Phased LoRA TTT, VarLen flash attention, fused Triton MLP, SP8192 pipeline, and all quantization/optimization choices. train_gpt.py in this submission is byte-identical to PR #1700's.

Extended lineage:

@bigbag — PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 (merged SOTA): QK-Gain 5.25
@samacqua — PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 (VarLen + fused MLP base)
@clarkkev — PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 (SP8192 + GPTQ)
@abaybektursun — PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 (legal TTT framework)
@dexhunter — PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626 (Multi-Phase Global SGD TTT concept)

Test plan

3-seed training on 8×H100 SXM (seeds 42, 0, 1234)
All artifacts under 16 MB
All trainings and evals within budget
Score-first TTT preserved exactly as in PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700

🤖 Generated with Claude Code

… (3-seed mean)

Round34's openai#1727 lane depends on evaluator-side source inspection to pick the correct dataset/tokenizer surface. This patch makes the SP8192 defaults explicit in the source header so the remote launcher cannot silently fall back to sp1024. Constraint: W96 must test the current SP8192 frontier, not a launcher-derived variant Rejected: Trust regex on the existing code shape | local detect-vocab still falls back to 1024 here Confidence: high Scope-risk: narrow Directive: Any future faithful replay should make source-visible defaults explicit when the evaluator infers runtime from source text Tested: python3 -m py_compile train_gpt.py evaluate.py remote_helper.py; python3 remote_helper.py detect-vocab Not-tested: remote rerun after relaunch

The first round34 W96 run only fixed the packed-wrapper SP8192 autodetect, but it still replayed code-default hyperparameters rather than the specific env-derived surface claimed in PR openai#1727. This patch makes the claimed QK gain, phased TTT, and quantization defaults explicit so the rerun tests the advertised stack. Constraint: Round34 is validating PR openai#1727 as claimed, not a weaker code-default derivative Rejected: Keep the original defaults | would test the wrong surface again Confidence: high Scope-risk: narrow Directive: Treat W96 results before this commit as non-faithful evidence only Tested: python3 -m py_compile train_gpt.py evaluate.py remote_helper.py; python3 remote_helper.py detect-vocab Not-tested: remote rerun after relaunch

…ad, MP-SGD TTT 4-phase - PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter (~1.189 actual) + artifact size violation; effectively dead - New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) — reversible case-factoring with byte sidecar; stronger legality than casefold; await Issue openai#1604 ruling - PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738 builds on it, both likely void - PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable - Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31

Four post-training specs to stack on 060A's openai#1855 port: - 060I: port PR openai#1908's activation-aware mixed-bit GPTQ (3-seed validated −0.000265 BPB on openai#1855 itself). 4 env vars + ~100 LOC port. - 060J: PHASED_TTT_NUM_PHASES 3→4 (low confidence; openai#1727 measured noise on weaker base, never tested with 2500 prefix). - 060L: PHASED_TTT_PREFIX_DOCS 2500→3000 (high confidence; codemath3000 greedy-validated 2000→2500 on this exact stack in openai#1855). - 060M: TTT_EPOCHS 3→4 (highest predicted Δ; PR openai#1812 reported −0.008 on weaker base; never tested on phased+SmearGate stack like openai#1855). All eval-only via RESUME_FROM_CKPT on 060A's seed_42_4h pt. No code change for 060J/L/M. 060K (rank-up) deleted — rowed against openai#1855's own greedy direction (which decreased rank 96→80). Idea files: research/ideas/{1908-awq-lite-mixed-bit-gptq,ttt-budget-reinvestment}.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217…

6927947

… (3-seed mean)

yahya010 mentioned this pull request Apr 19, 2026

Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts) #1734

Closed

12 tasks

abi2024 mentioned this pull request Apr 24, 2026

Audit 1698 lineage bpb bytecount #1804

Open

deborahnelson8788726 mentioned this pull request Apr 27, 2026

Non-record: Trinity Ternary CPU v3 — Apple M1 Pro 72h, val_bpb 1.5042 #1866

Open

deborahnelson8788726 mentioned this pull request Apr 29, 2026

Record: SP8192 + yahya010 NN base + byte-PPM mixer — val_bpb 0.99145 … #1933

Closed

7 tasks

Kbediako mentioned this pull request Apr 30, 2026

Non-record: final-day fallbacks incl. StageB scalar-control no-TTT #1980

Open

IanniMuliterno mentioned this pull request May 1, 2026

[Record Candidate] SP8192 · GatedAttn + Phased TTT + LQER · 10 min / 16 MB #2065

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean)#1727

Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean)#1727
yahya010 wants to merge 1 commit intoopenai:mainfrom
yahya010:submit/yahya010-mpsgd-phases4-qk525

yahya010 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yahya010 commented Apr 18, 2026

Summary

Results (8×H100 SXM, torch 2.9.1+cu128)

Compliance (Issue #1017 Track A)

Reproduction

Attribution

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant