Skip to content

Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean)#1727

Open
yahya010 wants to merge 1 commit intoopenai:mainfrom
yahya010:submit/yahya010-mpsgd-phases4-qk525
Open

Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean)#1727
yahya010 wants to merge 1 commit intoopenai:mainfrom
yahya010:submit/yahya010-mpsgd-phases4-qk525

Conversation

@yahya010
Copy link
Copy Markdown

Summary

3-seed mean val_bpb = 1.07217 (std 0.00114) on Track A (10min/16MB), beating merged SOTA (1.0810, @bigbag PR #1493) by 0.00883 BPB / ~0.00612 nats, clearing the 0.005-nat / ~0.0072 BPB threshold.

This PR extends the Multi-Phase Global SGD + Phased LoRA TTT stack from open PR #1700 (@jorge-asenjo) with two hyperparameter tunes:

  1. PHASED_TTT_NUM_PHASES=4 (up from PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700's 3) — one additional multi-phase SGD adaptation pass at eval time. PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700's 3-phase eval used ~352s out of the 600s budget; adding a 4th phase adds ~25-70s and fits comfortably (observed eval times: 349-396s across all 3 seeds).
  2. QK_GAIN_INIT=5.25 (up from PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700's 5.0) — matches merged SOTA PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493.

train_gpt.py is byte-identical to PR #1700's. The only changes are the two environment variables above.

Results (8×H100 SXM, torch 2.9.1+cu128)

Seed val_bpb artifact bytes
42 1.07310 15,933,641
0 1.07090 15,938,690
1234 1.07250 15,930,318
mean 1.07217
std 0.00114

Comparison with PR #1700 (same code, same seeds, phases=3):

Seed #1700 val_bpb this PR (phases=4) delta
42 1.07332 1.07310 -0.00022
0 1.07115 1.07090 -0.00025
1234 1.07211 1.07250 +0.00039
mean 1.07219 1.07217 -0.00002

The extra 4th phase yields a very small (within-noise) mean delta versus PR #1700. The submission clears the merged-SOTA threshold through PR #1700's stack; the novelty this PR contributes is (a) independent reproduction on different hardware, (b) verification that an additional 4th MP-SGD phase is legal and fits the budget, and (c) the QK-Gain 5.25 import from merged SOTA.

Compliance (Issue #1017 Track A)

  • Condition 1 (Causality): standard causal attention
  • Condition 2 (Normalized): standard softmax over full vocab
  • Condition 3 (Score-before-update): each of the 4 phases is scored under torch.no_grad() before any SGD update
  • Condition 4 (Single pass): each token is scored exactly once per phase, no rescoring across phases
  • No SLOT, no n-gram cache, no ETLB, no pre-quant TTT
  • All 3 artifacts < 16 MB (max 15,938,690 B)
  • All 3 trainings < 600s (wallclock-capped at 596s)
  • All 3 evals < 600s (348s / 349s / 396s)

Reproduction

pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
pip install brotli sentencepiece python-minifier numpy huggingface-hub zstandard einops ninja

MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
  python3 data/cached_challenge_fineweb.py --variant sp8192

for seed in 42 0 1234; do
  SEED=$seed \
  QK_GAIN_INIT=5.25 \
  PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=4 \
  MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
  MATRIX_LR=0.026 GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py
done

Attribution

Full credit to @jorge-asenjo (PR #1700) for the base stack: Multi-Phase Global SGD TTT + Phased LoRA TTT, VarLen flash attention, fused Triton MLP, SP8192 pipeline, and all quantization/optimization choices. train_gpt.py in this submission is byte-identical to PR #1700's.

Extended lineage:

Test plan

🤖 Generated with Claude Code

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 19, 2026
Round34's openai#1727 lane depends on evaluator-side source inspection to pick the
correct dataset/tokenizer surface. This patch makes the SP8192 defaults explicit
in the source header so the remote launcher cannot silently fall back to sp1024.

Constraint: W96 must test the current SP8192 frontier, not a launcher-derived variant
Rejected: Trust regex on the existing code shape | local detect-vocab still falls back to 1024 here
Confidence: high
Scope-risk: narrow
Directive: Any future faithful replay should make source-visible defaults explicit when the evaluator infers runtime from source text
Tested: python3 -m py_compile train_gpt.py evaluate.py remote_helper.py; python3 remote_helper.py detect-vocab
Not-tested: remote rerun after relaunch
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 19, 2026
The first round34 W96 run only fixed the packed-wrapper SP8192 autodetect, but it
still replayed code-default hyperparameters rather than the specific env-derived
surface claimed in PR openai#1727. This patch makes the claimed QK gain, phased TTT,
and quantization defaults explicit so the rerun tests the advertised stack.

Constraint: Round34 is validating PR openai#1727 as claimed, not a weaker code-default derivative
Rejected: Keep the original defaults | would test the wrong surface again
Confidence: high
Scope-risk: narrow
Directive: Treat W96 results before this commit as non-faithful evidence only
Tested: python3 -m py_compile train_gpt.py evaluate.py remote_helper.py; python3 remote_helper.py detect-vocab
Not-tested: remote rerun after relaunch
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 19, 2026
…ad, MP-SGD TTT 4-phase

- PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter
  (~1.189 actual) + artifact size violation; effectively dead
- New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) —
  reversible case-factoring with byte sidecar; stronger legality than
  casefold; await Issue openai#1604 ruling
- PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738
  builds on it, both likely void
- PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable
- Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline

https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 28, 2026
Four post-training specs to stack on 060A's openai#1855 port:

- 060I: port PR openai#1908's activation-aware mixed-bit GPTQ (3-seed validated
  −0.000265 BPB on openai#1855 itself). 4 env vars + ~100 LOC port.
- 060J: PHASED_TTT_NUM_PHASES 3→4 (low confidence; openai#1727 measured noise on
  weaker base, never tested with 2500 prefix).
- 060L: PHASED_TTT_PREFIX_DOCS 2500→3000 (high confidence; codemath3000
  greedy-validated 2000→2500 on this exact stack in openai#1855).
- 060M: TTT_EPOCHS 3→4 (highest predicted Δ; PR openai#1812 reported −0.008 on
  weaker base; never tested on phased+SmearGate stack like openai#1855).

All eval-only via RESUME_FROM_CKPT on 060A's seed_42_4h pt. No code change
for 060J/L/M. 060K (rank-up) deleted — rowed against openai#1855's own greedy
direction (which decreased rank 96→80).

Idea files: research/ideas/{1908-awq-lite-mixed-bit-gptq,ttt-budget-reinvestment}.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant