Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb)#1700
Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb)#1700jorge-asenjo wants to merge 2 commits intoopenai:mainfrom
Conversation
3-seed mean val_bpb 1.07219 (seeds 42/0/1234 = 1.07332/1.07115/1.07211). All artifacts <16MB, legal under Issue openai#1017 (score-first, single pass). Multi-phase global SGD at test-time: within each phase, chunks scored under torch.no_grad() before any weight update, then SGD on scored tokens. Combined with SP-8192, int7 embeddings, per-layer GPTQ + sigma clipping, Muon optimizer, depth recurrence, VarLen FA3, and fused triton MLP.
W78 showed that the raw default surface is nowhere near the claimed score, but openai#1700 differs from openai#1667 because its attached train logs and README do agree on the eval-time mechanism. This branch bakes in the surfaced settings from the PR materials: phased TTT enabled with 3 phases, int7 embeddings, tighter MLP/embed clip sigmas, and an 80-shard training view matching the attached logs. Constraint: Keep the architecture fixed and change only the public surface defaults needed to match the PR's own materials Rejected: Jump straight to new architecture tuning | the unresolved question is still whether openai#1700's claimed public surface is reproducible Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat this as a claimed/log-aligned reproduction lane, not as an original tuning line Tested: python3 -m py_compile train_gpt.py Not-tested: Remote train/eval on Lepton
|
Reading through this PR I am getting three different per-seed mean candidates depending The README per-seed table gives Could you confirm which set is canonical? If the logs are authoritative the headline |
README had stale numbers from an earlier run whose logs were lost during a pod restart; the committed per-seed logs and submission.json are from the second 3-seed run. Headline and table now match the logs: seeds 42/0/1234 = 1.07332/1.07115/1.07211, mean 1.07219.
|
Good catch — the logs and submission.json are authoritative (1.07219). The stale README table was from an earlier 3-seed run whose per-seed logs were lost during a pod restart; I then re-ran all three seeds and that's the run that is committed (train_seed{42,0,1234}.log). I've just pushed 5f54d26 updating the headline and table to 1.07219 / seeds 1.07332 / 1.07115 / 1.07211 so README, logs and submission.json all agree. |
…verlay Supersedes the 2026-04-18 post-Ultrareview pin (876bb36). Rev 5 adds provenance auto-capture, repo-type=model fix, exact-SHA env-var pin, and run_all.sh/README alignment; new pin reflects the pipeline-patch commit. Also records the live-guidance absolute-BPB overlay and 04b deprecation driven by open-PR competitive intel (openai#1700 / openai#1716 / openai#1707 / openai#1693). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…base Stage 1 of cross-stack port: minimal model-level additions on top of PR openai#1700 (Multi-Phase Global SGD + Phased TTT + VarLen + DepthRec, 1.07219 mean) without touching the weight-bank attention path. Changes: - QK_GAIN_INIT default 5.0 -> 5.25 - SmearGate (modded-nanogpt forward-1 token smear) added at model level, inserted between tok_emb and rms_norm in both forward_logits and forward_ttt. New params (smear_gate.weight, smear_lambda) auto passthrough quant via numel<=65536 rule and registered with the scalar AdamW optimizer. AttnOutGate (the larger of the two gates from PR openai#1667) is deferred to Stage 2 since it needs surgery inside the attention/bank forward. If Stage 1 lands <=1.0710 it validates the port + motivates Stage 2.
…ed mean) 3-seed mean 1.01080 (std 0.00115), seeds 42/314/999: - seed 42: 1.01205 - seed 314: 1.00978 (below PR openai#1698's entire 3-seed mean) - seed 999: 1.01056 Beats merged SOTA (1.0810, PR openai#1493) by -0.07020 BPB. Macro-phase SGD TTT hook added from PR openai#1700's Multi-Phase Global SGD design but disabled in the scored run (ttt_macro_phases=0) — on seed 42 it was wash vs vanilla per-chunk SGD (-0.00999 vs -0.01012 TTT gain), so not worth the extra eval time. All artifacts < 16,000,000 bytes. All train < 600s. All eval < 600s.
…base Builds on Stage 1 (SmearGate + QK-Gain 5.25, seed 42 = 1.07219). Adds per-head multiplicative gate inside attention (g = 2*sigmoid(W x[:,:12]) broadcast across head_dim, applied between flash_attn output and out_proj). Zero-init projection so gate starts at ~1.0 — Stage 2 is numerically identical to Stage 1 at step 0. Wired into: - CausalSelfAttention.forward (forward_logits path) - _block_with_lora (sequential TTT path) - _parallel_block_with_lora (parallel TTT path, layers >= parallel_start_layer) Param footprint: 96 floats per layer (8 heads x 12 width), 1152 total across 12 layers. Auto-passthrough via numel <= 65536 quant rule. Routed to scalar AdamW via attn_gate_proj entry in CONTROL_TENSOR_NAME_PATTERNS. Hypothesis: AttnOutGate adds ~0.0010-0.0015 BPB on top of Stage 1. Combined with Stage 1 gain (0.0011 over PR openai#1700), full PR openai#1667 -> PR openai#1700 cross-stack port should reach ~1.0707-1.0710 (seed 42).
Drops RoPE entirely from the SmearGate + AttnOutGate + QK525 frontier to test whether absolute-position bias is bottlenecking the PR openai#1700 TTT + PR openai#1667 gates stack at ~1.071 BPB. PR openai#1718 explicitly flagged relative-position attention as the next architectural axis, and no PR has tried NoPE at frontier. ALiBi was the first choice, but FA3 (Dao-AILab/flash-attention/hopper/flash_attn_interface.py) has no alibi_slopes parameter, and FA2 fallback breaks the 600s budget under TTT. NoPE is the cheapest position-axis test under FA3. NOPE env knob (default 1) gates apply_rotary_emb in three attn paths: forward(), _block_with_lora(), _parallel_block_with_lora(). Rotary module is still constructed so warmup calls remain harmless and the diff is reversible by NOPE=0 (reproduces Stage 2 numerics). Zero new params, submission size unchanged.
Summary
3-seed mean val_bpb 1.07219 on Track A (10min/16MB) using multi-phase global SGD at test-time combined with phased LoRA TTT, SP-8192 tokenization, int7 embeddings, per-layer GPTQ with sigma clipping, Muon optimizer, depth recurrence, VarLen flash attention, and fused triton MLP.
Results
Approach
Multi-phase global SGD splits validation into N phases. Within each phase:
torch.no_grad()(score-first)This cycles for 3 phases, letting the base model progressively adapt to validation distribution while remaining legal under Issue #1017 (causal, normalized softmax, score-before-update, single pass).
Legal compliance
torch.no_grad()BEFORE any SGD updateReproduction
8x H100 SXM, torch 2.9.1+cu128, flash_attn_3 (Hopper wheel). Env vars and seeds documented in
records/track_10min_16mb/2026-04-16_SP8192_MultiPhaseGlobalSGD_PhasedTTT/README.md.Test plan