Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean)#1727
Open
yahya010 wants to merge 1 commit intoopenai:mainfrom
Open
Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean)#1727yahya010 wants to merge 1 commit intoopenai:mainfrom
yahya010 wants to merge 1 commit intoopenai:mainfrom
Conversation
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 19, 2026
Round34's openai#1727 lane depends on evaluator-side source inspection to pick the correct dataset/tokenizer surface. This patch makes the SP8192 defaults explicit in the source header so the remote launcher cannot silently fall back to sp1024. Constraint: W96 must test the current SP8192 frontier, not a launcher-derived variant Rejected: Trust regex on the existing code shape | local detect-vocab still falls back to 1024 here Confidence: high Scope-risk: narrow Directive: Any future faithful replay should make source-visible defaults explicit when the evaluator infers runtime from source text Tested: python3 -m py_compile train_gpt.py evaluate.py remote_helper.py; python3 remote_helper.py detect-vocab Not-tested: remote rerun after relaunch
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 19, 2026
The first round34 W96 run only fixed the packed-wrapper SP8192 autodetect, but it still replayed code-default hyperparameters rather than the specific env-derived surface claimed in PR openai#1727. This patch makes the claimed QK gain, phased TTT, and quantization defaults explicit so the rerun tests the advertised stack. Constraint: Round34 is validating PR openai#1727 as claimed, not a weaker code-default derivative Rejected: Keep the original defaults | would test the wrong surface again Confidence: high Scope-risk: narrow Directive: Treat W96 results before this commit as non-faithful evidence only Tested: python3 -m py_compile train_gpt.py evaluate.py remote_helper.py; python3 remote_helper.py detect-vocab Not-tested: remote rerun after relaunch
sunnypatneedi
pushed a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Apr 19, 2026
…ad, MP-SGD TTT 4-phase - PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter (~1.189 actual) + artifact size violation; effectively dead - New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) — reversible case-factoring with byte sidecar; stronger legality than casefold; await Issue openai#1604 ruling - PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738 builds on it, both likely void - PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable - Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 28, 2026
Four post-training specs to stack on 060A's openai#1855 port: - 060I: port PR openai#1908's activation-aware mixed-bit GPTQ (3-seed validated −0.000265 BPB on openai#1855 itself). 4 env vars + ~100 LOC port. - 060J: PHASED_TTT_NUM_PHASES 3→4 (low confidence; openai#1727 measured noise on weaker base, never tested with 2500 prefix). - 060L: PHASED_TTT_PREFIX_DOCS 2500→3000 (high confidence; codemath3000 greedy-validated 2000→2500 on this exact stack in openai#1855). - 060M: TTT_EPOCHS 3→4 (highest predicted Δ; PR openai#1812 reported −0.008 on weaker base; never tested on phased+SmearGate stack like openai#1855). All eval-only via RESUME_FROM_CKPT on 060A's seed_42_4h pt. No code change for 060J/L/M. 060K (rank-up) deleted — rowed against openai#1855's own greedy direction (which decreased rank 96→80). Idea files: research/ideas/{1908-awq-lite-mixed-bit-gptq,ttt-budget-reinvestment}.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
3-seed mean val_bpb = 1.07217 (std 0.00114) on Track A (10min/16MB), beating merged SOTA (1.0810, @bigbag PR #1493) by 0.00883 BPB / ~0.00612 nats, clearing the 0.005-nat / ~0.0072 BPB threshold.
This PR extends the Multi-Phase Global SGD + Phased LoRA TTT stack from open PR #1700 (@jorge-asenjo) with two hyperparameter tunes:
PHASED_TTT_NUM_PHASES=4(up from PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700's 3) — one additional multi-phase SGD adaptation pass at eval time. PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700's 3-phase eval used ~352s out of the 600s budget; adding a 4th phase adds ~25-70s and fits comfortably (observed eval times: 349-396s across all 3 seeds).QK_GAIN_INIT=5.25(up from PR Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) #1700's 5.0) — matches merged SOTA PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493.train_gpt.pyis byte-identical to PR #1700's. The only changes are the two environment variables above.Results (8×H100 SXM, torch 2.9.1+cu128)
Comparison with PR #1700 (same code, same seeds, phases=3):
The extra 4th phase yields a very small (within-noise) mean delta versus PR #1700. The submission clears the merged-SOTA threshold through PR #1700's stack; the novelty this PR contributes is (a) independent reproduction on different hardware, (b) verification that an additional 4th MP-SGD phase is legal and fits the budget, and (c) the QK-Gain 5.25 import from merged SOTA.
Compliance (Issue #1017 Track A)
torch.no_grad()before any SGD updateReproduction
Attribution
Full credit to @jorge-asenjo (PR #1700) for the base stack: Multi-Phase Global SGD TTT + Phased LoRA TTT, VarLen flash attention, fused Triton MLP, SP8192 pipeline, and all quantization/optimization choices.
train_gpt.pyin this submission is byte-identical to PR #1700's.Extended lineage:
Test plan
🤖 Generated with Claude Code