Skip to content

Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb)#1700

Open
jorge-asenjo wants to merge 2 commits intoopenai:mainfrom
jorge-asenjo:submit/multiphase-sgd-ttt-1.07219
Open

Add SP8192 Multi-Phase Global SGD + Phased TTT (1.07219 bpb)#1700
jorge-asenjo wants to merge 2 commits intoopenai:mainfrom
jorge-asenjo:submit/multiphase-sgd-ttt-1.07219

Conversation

@jorge-asenjo
Copy link
Copy Markdown

Summary

3-seed mean val_bpb 1.07219 on Track A (10min/16MB) using multi-phase global SGD at test-time combined with phased LoRA TTT, SP-8192 tokenization, int7 embeddings, per-layer GPTQ with sigma clipping, Muon optimizer, depth recurrence, VarLen flash attention, and fused triton MLP.

Results

Seed val_bpb artifact
42 1.07332 15,930,192 B
0 1.07115 15,939,461 B
1234 1.07211 15,930,004 B
mean 1.07219 all <16MB

Approach

Multi-phase global SGD splits validation into N phases. Within each phase:

  1. All chunks are scored under torch.no_grad() (score-first)
  2. Base model weights updated with SGD on scored tokens (train on already-scored tokens only)

This cycles for 3 phases, letting the base model progressively adapt to validation distribution while remaining legal under Issue #1017 (causal, normalized softmax, score-before-update, single pass).

Legal compliance

  • Causal: standard causal attention
  • Normalized softmax: yes
  • Score-before-update: each chunk fully scored under torch.no_grad() BEFORE any SGD update
  • Single pass: each token scored exactly once

Reproduction

8x H100 SXM, torch 2.9.1+cu128, flash_attn_3 (Hopper wheel). Env vars and seeds documented in records/track_10min_16mb/2026-04-16_SP8192_MultiPhaseGlobalSGD_PhasedTTT/README.md.

Test plan

  • 3-seed run completed with full training and evaluation logs
  • All artifacts fit under 16MB cap
  • Scores consistent across seeds (spread: 1.07115 - 1.07332)

3-seed mean val_bpb 1.07219 (seeds 42/0/1234 = 1.07332/1.07115/1.07211).
All artifacts <16MB, legal under Issue openai#1017 (score-first, single pass).

Multi-phase global SGD at test-time: within each phase, chunks scored
under torch.no_grad() before any weight update, then SGD on scored tokens.
Combined with SP-8192, int7 embeddings, per-layer GPTQ + sigma clipping,
Muon optimizer, depth recurrence, VarLen FA3, and fused triton MLP.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 18, 2026
W78 showed that the raw default surface is nowhere near the claimed score, but
openai#1700 differs from openai#1667 because its attached train logs and README do agree on
the eval-time mechanism. This branch bakes in the surfaced settings from the PR
materials: phased TTT enabled with 3 phases, int7 embeddings, tighter MLP/embed
clip sigmas, and an 80-shard training view matching the attached logs.

Constraint: Keep the architecture fixed and change only the public surface defaults needed to match the PR's own materials
Rejected: Jump straight to new architecture tuning | the unresolved question is still whether openai#1700's claimed public surface is reproducible
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Treat this as a claimed/log-aligned reproduction lane, not as an original tuning line
Tested: python3 -m py_compile train_gpt.py
Not-tested: Remote train/eval on Lepton
@himanshudongre
Copy link
Copy Markdown

Reading through this PR I am getting three different per-seed mean candidates depending
on which artifact I look at, and I want to make sure I am reading them right.

The README per-seed table gives 1.07294 / 1.07213 / 1.07259 (mean 1.07255). The
committed seed logs give 1.07332 / 1.07115 / 1.07211 (mean 1.07219). The
submission.json reports 1.07219, matching the logs.

Could you confirm which set is canonical? If the logs are authoritative the headline
should probably say 1.07219 to match submission.json; if the README table is right
something needs to be reconciled with the logs.

README had stale numbers from an earlier run whose logs were lost during
a pod restart; the committed per-seed logs and submission.json are from
the second 3-seed run. Headline and table now match the logs:
seeds 42/0/1234 = 1.07332/1.07115/1.07211, mean 1.07219.
@jorge-asenjo
Copy link
Copy Markdown
Author

Good catch — the logs and submission.json are authoritative (1.07219). The stale README table was from an earlier 3-seed run whose per-seed logs were lost during a pod restart; I then re-ran all three seeds and that's the run that is committed (train_seed{42,0,1234}.log). I've just pushed 5f54d26 updating the headline and table to 1.07219 / seeds 1.07332 / 1.07115 / 1.07211 so README, logs and submission.json all agree.

amrayach added a commit to amrayach/parameter-golf that referenced this pull request Apr 18, 2026
…verlay

Supersedes the 2026-04-18 post-Ultrareview pin (876bb36). Rev 5 adds
provenance auto-capture, repo-type=model fix, exact-SHA env-var pin, and
run_all.sh/README alignment; new pin reflects the pipeline-patch commit.

Also records the live-guidance absolute-BPB overlay and 04b deprecation
driven by open-PR competitive intel (openai#1700 / openai#1716 / openai#1707 / openai#1693).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 18, 2026
…base

Stage 1 of cross-stack port: minimal model-level additions on top of
PR openai#1700 (Multi-Phase Global SGD + Phased TTT + VarLen + DepthRec,
1.07219 mean) without touching the weight-bank attention path.

Changes:
- QK_GAIN_INIT default 5.0 -> 5.25
- SmearGate (modded-nanogpt forward-1 token smear) added at model
  level, inserted between tok_emb and rms_norm in both forward_logits
  and forward_ttt. New params (smear_gate.weight, smear_lambda) auto
  passthrough quant via numel<=65536 rule and registered with the
  scalar AdamW optimizer.

AttnOutGate (the larger of the two gates from PR openai#1667) is deferred
to Stage 2 since it needs surgery inside the attention/bank forward.

If Stage 1 lands <=1.0710 it validates the port + motivates Stage 2.
yahya010 added a commit to yahya010/parameter-golf that referenced this pull request Apr 19, 2026
…ed mean)

3-seed mean 1.01080 (std 0.00115), seeds 42/314/999:
  - seed 42:  1.01205
  - seed 314: 1.00978 (below PR openai#1698's entire 3-seed mean)
  - seed 999: 1.01056
Beats merged SOTA (1.0810, PR openai#1493) by -0.07020 BPB.

Macro-phase SGD TTT hook added from PR openai#1700's Multi-Phase Global SGD
design but disabled in the scored run (ttt_macro_phases=0) — on seed 42
it was wash vs vanilla per-chunk SGD (-0.00999 vs -0.01012 TTT gain),
so not worth the extra eval time.

All artifacts < 16,000,000 bytes. All train < 600s. All eval < 600s.
sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 19, 2026
…base

Builds on Stage 1 (SmearGate + QK-Gain 5.25, seed 42 = 1.07219). Adds
per-head multiplicative gate inside attention (g = 2*sigmoid(W x[:,:12])
broadcast across head_dim, applied between flash_attn output and
out_proj). Zero-init projection so gate starts at ~1.0 — Stage 2 is
numerically identical to Stage 1 at step 0.

Wired into:
- CausalSelfAttention.forward (forward_logits path)
- _block_with_lora (sequential TTT path)
- _parallel_block_with_lora (parallel TTT path, layers >= parallel_start_layer)

Param footprint: 96 floats per layer (8 heads x 12 width), 1152 total
across 12 layers. Auto-passthrough via numel <= 65536 quant rule.
Routed to scalar AdamW via attn_gate_proj entry in
CONTROL_TENSOR_NAME_PATTERNS.

Hypothesis: AttnOutGate adds ~0.0010-0.0015 BPB on top of Stage 1.
Combined with Stage 1 gain (0.0011 over PR openai#1700), full PR openai#1667 ->
PR openai#1700 cross-stack port should reach ~1.0707-1.0710 (seed 42).
sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 20, 2026
Drops RoPE entirely from the SmearGate + AttnOutGate + QK525 frontier
to test whether absolute-position bias is bottlenecking the PR openai#1700
TTT + PR openai#1667 gates stack at ~1.071 BPB. PR openai#1718 explicitly flagged
relative-position attention as the next architectural axis, and no PR
has tried NoPE at frontier.

ALiBi was the first choice, but FA3
(Dao-AILab/flash-attention/hopper/flash_attn_interface.py) has no
alibi_slopes parameter, and FA2 fallback breaks the 600s budget under
TTT. NoPE is the cheapest position-axis test under FA3.

NOPE env knob (default 1) gates apply_rotary_emb in three attn paths:
forward(), _block_with_lora(), _parallel_block_with_lora(). Rotary
module is still constructed so warmup calls remain harmless and the
diff is reversible by NOPE=0 (reproduces Stage 2 numerics). Zero new
params, submission size unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants