Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb) by jorge-asenjo · Pull Request #1700 · openai/parameter-golf

jorge-asenjo · 2026-04-17T20:53:40Z

Summary

3-seed mean val_bpb 1.07219 on Track A (10min/16MB) using multi-phase global SGD at test-time combined with phased LoRA TTT, SP-8192 tokenization, int7 embeddings, per-layer GPTQ with sigma clipping, Muon optimizer, depth recurrence, VarLen flash attention, and fused triton MLP.

Results

Seed	val_bpb	artifact
42	1.07332	15,930,192 B
0	1.07115	15,939,461 B
1234	1.07211	15,930,004 B
mean	1.07219	all <16MB

Approach

Multi-phase global SGD splits validation into N phases. Within each phase:

All chunks are scored under torch.no_grad() (score-first)
Base model weights updated with SGD on scored tokens (train on already-scored tokens only)

This cycles for 3 phases, letting the base model progressively adapt to validation distribution while remaining legal under Issue #1017 (causal, normalized softmax, score-before-update, single pass).

Legal compliance

Causal: standard causal attention
Normalized softmax: yes
Score-before-update: each chunk fully scored under torch.no_grad() BEFORE any SGD update
Single pass: each token scored exactly once

Reproduction

8x H100 SXM, torch 2.9.1+cu128, flash_attn_3 (Hopper wheel). Env vars and seeds documented in records/track_10min_16mb/2026-04-16_SP8192_MultiPhaseGlobalSGD_PhasedTTT/README.md.

Test plan

3-seed run completed with full training and evaluation logs
All artifacts fit under 16MB cap
Scores consistent across seeds (spread: 1.07115 - 1.07332)

3-seed mean val_bpb 1.07219 (seeds 42/0/1234 = 1.07332/1.07115/1.07211). All artifacts <16MB, legal under Issue openai#1017 (score-first, single pass). Multi-phase global SGD at test-time: within each phase, chunks scored under torch.no_grad() before any weight update, then SGD on scored tokens. Combined with SP-8192, int7 embeddings, per-layer GPTQ + sigma clipping, Muon optimizer, depth recurrence, VarLen FA3, and fused triton MLP.

W78 showed that the raw default surface is nowhere near the claimed score, but openai#1700 differs from openai#1667 because its attached train logs and README do agree on the eval-time mechanism. This branch bakes in the surfaced settings from the PR materials: phased TTT enabled with 3 phases, int7 embeddings, tighter MLP/embed clip sigmas, and an 80-shard training view matching the attached logs. Constraint: Keep the architecture fixed and change only the public surface defaults needed to match the PR's own materials Rejected: Jump straight to new architecture tuning | the unresolved question is still whether openai#1700's claimed public surface is reproducible Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat this as a claimed/log-aligned reproduction lane, not as an original tuning line Tested: python3 -m py_compile train_gpt.py Not-tested: Remote train/eval on Lepton

himanshudongre · 2026-04-18T13:16:48Z

Reading through this PR I am getting three different per-seed mean candidates depending
on which artifact I look at, and I want to make sure I am reading them right.

The README per-seed table gives 1.07294 / 1.07213 / 1.07259 (mean 1.07255). The
committed seed logs give 1.07332 / 1.07115 / 1.07211 (mean 1.07219). The
submission.json reports 1.07219, matching the logs.

Could you confirm which set is canonical? If the logs are authoritative the headline
should probably say 1.07219 to match submission.json; if the README table is right
something needs to be reconciled with the logs.

README had stale numbers from an earlier run whose logs were lost during a pod restart; the committed per-seed logs and submission.json are from the second 3-seed run. Headline and table now match the logs: seeds 42/0/1234 = 1.07332/1.07115/1.07211, mean 1.07219.

jorge-asenjo · 2026-04-18T13:19:19Z

Good catch — the logs and submission.json are authoritative (1.07219). The stale README table was from an earlier 3-seed run whose per-seed logs were lost during a pod restart; I then re-ran all three seeds and that's the run that is committed (train_seed{42,0,1234}.log). I've just pushed 5f54d26 updating the headline and table to 1.07219 / seeds 1.07332 / 1.07115 / 1.07211 so README, logs and submission.json all agree.

…verlay Supersedes the 2026-04-18 post-Ultrareview pin (876bb36). Rev 5 adds provenance auto-capture, repo-type=model fix, exact-SHA env-var pin, and run_all.sh/README alignment; new pin reflects the pipeline-patch commit. Also records the live-guidance absolute-BPB overlay and 04b deprecation driven by open-PR competitive intel (openai#1700 / openai#1716 / openai#1707 / openai#1693). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…base Stage 1 of cross-stack port: minimal model-level additions on top of PR openai#1700 (Multi-Phase Global SGD + Phased TTT + VarLen + DepthRec, 1.07219 mean) without touching the weight-bank attention path. Changes: - QK_GAIN_INIT default 5.0 -> 5.25 - SmearGate (modded-nanogpt forward-1 token smear) added at model level, inserted between tok_emb and rms_norm in both forward_logits and forward_ttt. New params (smear_gate.weight, smear_lambda) auto passthrough quant via numel<=65536 rule and registered with the scalar AdamW optimizer. AttnOutGate (the larger of the two gates from PR openai#1667) is deferred to Stage 2 since it needs surgery inside the attention/bank forward. If Stage 1 lands <=1.0710 it validates the port + motivates Stage 2.

…ed mean) 3-seed mean 1.01080 (std 0.00115), seeds 42/314/999: - seed 42: 1.01205 - seed 314: 1.00978 (below PR openai#1698's entire 3-seed mean) - seed 999: 1.01056 Beats merged SOTA (1.0810, PR openai#1493) by -0.07020 BPB. Macro-phase SGD TTT hook added from PR openai#1700's Multi-Phase Global SGD design but disabled in the scored run (ttt_macro_phases=0) — on seed 42 it was wash vs vanilla per-chunk SGD (-0.00999 vs -0.01012 TTT gain), so not worth the extra eval time. All artifacts < 16,000,000 bytes. All train < 600s. All eval < 600s.

…base Builds on Stage 1 (SmearGate + QK-Gain 5.25, seed 42 = 1.07219). Adds per-head multiplicative gate inside attention (g = 2*sigmoid(W x[:,:12]) broadcast across head_dim, applied between flash_attn output and out_proj). Zero-init projection so gate starts at ~1.0 — Stage 2 is numerically identical to Stage 1 at step 0. Wired into: - CausalSelfAttention.forward (forward_logits path) - _block_with_lora (sequential TTT path) - _parallel_block_with_lora (parallel TTT path, layers >= parallel_start_layer) Param footprint: 96 floats per layer (8 heads x 12 width), 1152 total across 12 layers. Auto-passthrough via numel <= 65536 quant rule. Routed to scalar AdamW via attn_gate_proj entry in CONTROL_TENSOR_NAME_PATTERNS. Hypothesis: AttnOutGate adds ~0.0010-0.0015 BPB on top of Stage 1. Combined with Stage 1 gain (0.0011 over PR openai#1700), full PR openai#1667 -> PR openai#1700 cross-stack port should reach ~1.0707-1.0710 (seed 42).

Drops RoPE entirely from the SmearGate + AttnOutGate + QK525 frontier to test whether absolute-position bias is bottlenecking the PR openai#1700 TTT + PR openai#1667 gates stack at ~1.071 BPB. PR openai#1718 explicitly flagged relative-position attention as the next architectural axis, and no PR has tried NoPE at frontier. ALiBi was the first choice, but FA3 (Dao-AILab/flash-attention/hopper/flash_attn_interface.py) has no alibi_slopes parameter, and FA2 fallback breaks the 600s budget under TTT. NoPE is the cheapest position-axis test under FA3. NOPE env knob (default 1) gates apply_rotary_emb in three attn paths: forward(), _block_with_lora(), _parallel_block_with_lora(). Rotary module is still constructed so warmup calls remain harmless and the diff is reversible by NOPE=0 (reproduces Stage 2 numerics). Zero new params, submission size unchanged.

yahya010 mentioned this pull request Apr 18, 2026

Record: SP8192 MP-SGD TTT (4 phases) + QK-Gain 5.25 — val_bpb 1.07217 (3-seed mean) #1727

Open

12 tasks

yahya010 mentioned this pull request Apr 19, 2026

Record: GatedDeltaNet + Legal TTT + Brotli-11 — val_bpb 1.01080 (3-seed mean, VALID artifacts) #1734

Closed

12 tasks

miaoyuxun mentioned this pull request Apr 23, 2026

Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790

Open

abi2024 mentioned this pull request Apr 24, 2026

Audit 1698 lineage bpb bytecount #1804

Open

AjAnubolu mentioned this pull request Apr 27, 2026

Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean) #1874

Open

4 tasks

jorge-asenjo mentioned this pull request Apr 29, 2026

Record: SP8192 #1855 Base + Asymmetric Logit Rescale + AWQ-lite — val_bpb 1.05971 (3-seed mean, full val) #1923

Open

3 tasks

deborahnelson8788726 mentioned this pull request Apr 29, 2026

Record: SP8192 + yahya010 NN base + byte-PPM mixer — val_bpb 0.99145 … #1933

Closed

7 tasks

sahiee-dev mentioned this pull request Apr 30, 2026

SP8192 + PolarExpressNS + MIN_LR + LQER Asym Rank-4 | val_bpb=1.07302 (3-seed mean) #1977

Open

9 tasks

Kbediako mentioned this pull request Apr 30, 2026

Non-record: final-day fallbacks incl. StageB scalar-control no-TTT #1980

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb)#1700

Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb)#1700
jorge-asenjo wants to merge 2 commits intoopenai:mainfrom
jorge-asenjo:submit/multiphase-sgd-ttt-1.07219

jorge-asenjo commented Apr 17, 2026

Uh oh!

himanshudongre commented Apr 18, 2026

Uh oh!

jorge-asenjo commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jorge-asenjo commented Apr 17, 2026

Summary

Results

Approach

Legal compliance

Reproduction

Test plan

Uh oh!

himanshudongre commented Apr 18, 2026

Uh oh!

jorge-asenjo commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants