Skip to content

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364)#339

Open
sheeki03 wants to merge 1 commit intoopenai:mainfrom
sheeki03:submission/11l-backout-1.1364
Open

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364)#339
sheeki03 wants to merge 1 commit intoopenai:mainfrom
sheeki03:submission/11l-backout-1.1364

Conversation

@sheeki03
Copy link
Copy Markdown

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364)

val_bpb: 1.1364 (sliding window, stride=64) | 16.17 MB | 8xH100 SXM, 600s

Known Issue

Artifact is 16,170,051 bytes — 170KB over the 16,000,000 byte cap. The code supports INT5_MLP=1 which switches MLP quantization from int6 to int5, saving 1-2MB. A follow-up run is planned to bring the artifact under the cap.

Progress from prior submissions

PR #198 This Delta
val_bpb (sliding) 1.1318 (s64) 1.1364 (s64) +0.0046
Steps (600s) 7412 6642 -770
Step time 81ms 90ms +9ms
Artifact 15.7 MB 16.2 MB +0.5 MB

Note: Our baseline replication of PR #198's config yielded 1.1435 (vs their reported 1.1318), likely due to hardware/driver differences (RunPod community cloud vs dedicated). Relative to our own baseline, Backout improves by -0.0071.

What's new

Backout Connection — A learned residual subtraction from a mid-network hidden state. After the U-Net encoder-decoder forward pass, the model subtracts lambda * h_mid from the final representation, where lambda is a learned scalar (initialized at 0.2) and h_mid is the hidden state at layer num_layers // 2.

This acts as a learned negative residual that removes redundant mid-network information, sharpening the final representation for the language modeling head. Zero additional matrix parameters — only one learned scalar.

Controlled comparison (same hardware, same run)

Metric Baseline (PR #198 config) + Backout Delta
val_bpb (sliding, s=64) 1.1435 1.1364 -0.0071
val_loss 1.9307 1.9188 -0.0119
Steps (600s) 5246 6642 +1396
Step time 114ms 90ms -24ms
Artifact 17.1 MB (zlib) 16.2 MB (zstd) -0.9 MB

Results

Metric Value
Pre-quant val_bpb 1.1544
Int6 roundtrip val_bpb 1.1588
Int6 sliding val_bpb (s64) 1.1364
Steps completed (600s cap) 6642
Step time 90ms
Artifact size 16,170,051 bytes
Code size 70,854 bytes
SWA checkpoints averaged 6

Architecture

11 layers, 512 dim, 8 heads / 4 KV heads, MLP 3x, relu-squared, SmearGate, BigramHash(4096), OrthoInit, Muon + AdamW with WD=0.04, SWA, int6 mixed quant + zstd, FA3, seq 2048, sliding window eval stride=64.

Backout layer: num_layers // 2 (layer 5). Lambda: learned scalar, initialized at 0.2.

Run command

NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=4096 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
BACKOUT_ENABLED=1 BACKOUT_LAMBDA_INIT=0.2 \
LAWA_ENABLED=0 INT5_MLP=0 VE_ENABLED=0 \
python3 -m torch.distributed.run --standalone --nproc_per_node=8 train_gpt.py

Hardware

8xH100 SXM 80GB HBM3 (RunPod, EUR-IS-3)

Next steps

  1. Run with INT5_MLP=1 to bring artifact under 16MB
  2. Multi-seed validation (3 seeds)
  3. Combine Backout with XSA + EMA + TTT from PRs Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315, Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256) #338

Adds Backout Connection — learned residual subtraction from mid-network
hidden state. Improves val_bpb by 0.0071 over PR openai#198 baseline with
zero additional matrix parameters (one learned scalar).

val_bpb: 1.1364 (sliding window, stride=64)
Artifact: 16,170,051 bytes (170KB over cap, fixable with INT5_MLP=1)
Hardware: 8xH100 SXM, 600s wallclock

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kasimte pushed a commit to kasimte/parameter-golf that referenced this pull request Mar 22, 2026
Three low-risk additions:
- Memory Tokens (64 learnable embeddings, -0.014 A/B, PR openai#352)
- Backout Connection (learned mid-layer subtraction, -0.007, PR openai#339)
- Tight SWA (scale<0.2, every 50, replacing EMA. PR openai#374)

Bugs found and fixed during review:
- memory_tokens/backout_lambda not in optimizer groups (code review)
- memory_tokens appended to embed_params AFTER optimizer creation (/refine)
- Dead encoder-loop h_mid check removed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364)

BPB: 1.1364 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 5761d2f296d7, file records/track_10min_16mb/2026-03-21_11L_Backout_Int6_SWA/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=70854 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=70854 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants