Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648) by m0at · Pull Request #107 · openai/parameter-golf

m0at · 2026-03-19T17:43:02Z

Summary

Stacked improvements on the NaiveBaseline (1.2244 BPB) -> 1.1648 BPB:

Int6 mixed quantization + zstd: MLP/Q/V/proj weights in 6-bit, zstd level 22. 21.3M params in 15.93MB.
MLP hidden=1488 (2.91x): Wider MLP enabled by int6 savings.
Sliding window eval (stride=64, seq_len=2048): ~960 context tokens per scored position.
Post-training QAT (30s): STE reduces quantization penalty.
Tuned LRs: scalar_lr=0.02, tied_embed_lr=0.03, matrix_lr=0.02 (from PR submission: Int6 MLP3x + Late-K Passthrough + SlidingWindow (val_bpb: 1.1605) #99).
FA3 fallback: Graceful fallback to SDPA when FA3 unavailable.

Results (8xH100 SXM, RunPod)

Metric	Baseline	This submission
post-quant BPB	1.2244	1.1648
improvement	--	-0.0596
params	17.1M	21.3M
compressed	15.86MB	15.93MB
steps	13,780	9,918
ms/step	43.5	57.5

Configuration

MLP_HIDDEN=1488 TRAIN_SEQ_LEN=2048 LOWBIT_BITS=6
LOWBIT_NAME_PATTERNS=.mlp.,.attn.c_q.,.attn.c_v.,.attn.proj.
SERIAL_COMPRESSOR=zstd EVAL_STRIDE=64
MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000
QK_GAIN_INIT=1.7 GRAD_CLIP_NORM=0.3 POST_QAT_SECONDS=30

Test plan

8xH100 SXM run: 1.1648 BPB
Compressed artifact 15.93MB < 16.00MB budget
train_gpt.py under 1500 lines (1492)
No network calls during training
Full train.log included

Flash Attention 3 (Hopper kernels) for ~8% faster steps + post-training quantization-aware training reduces int8+zlib penalty from +0.007 to +0.002 BPB. 3 seed runs (1337, 42, 7) with mean post-quant val_bpb=1.2245. Improvement vs local baseline: -0.0055 BPB / -0.0093 val_loss nats.

Post-quant sliding window BPB: 1.1747 (-0.0497 vs baseline 1.2244) 21.4M params in 15.98MB via int6 quantization + zstd compression. 8xH100 SXM, 9473 steps in 570s + 30s QAT + sliding window eval.

MLP_HIDDEN=1488, 15.93MB. 9918 steps in 570s (57ms/step). LR tuning from PR openai#99: scalar_lr 0.04->0.02, embed_lr 0.05->0.03. Improvement vs baseline: -0.0596 BPB.

MatoTeziTanka · 2026-04-12T14:17:12Z

PR #107 Review

Title: Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648)
State: open
Date Reviewed: 2026-04-11

Code Analysis

train_gpt.py Checks

target-in-key pattern: not found
TTT (Temporal Token Tagging): not found
SLOT (Slot MoE): not found
Custom Tokenizer: not found

Verdict

Classification: PURE_NEURAL_CLEAN

Recommendation:

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: ✓ MERGE

Draft by @MatoTeziTanka for parameter-golf review sweep (2026-04-11)

Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

…stic gates + AdaSplash kernel wrap Applies the principled remediation for the remaining recompile vectors found in profile_v2 (post commit 7343c06): Fix openai#3 — _ROUTER_DIAGNOSTICS_ACTIVE flag-flip recompiles. The flag toggles at every diagnostic emission (~every 10 train steps). dynamo guards on the global inside compiled forward → toggle invalidates cache slot. Solution: hoist each gated diagnostic block into a `@dynamo_disable`-decorated helper method on the owning class. dynamo treats the call as an opaque op (one fixed graph break, no internal-state guards). Sites refactored: - `_should_diag` (L111) — function itself decorated - `SoftDenseRouter._maybe_record_diag` (L1548-1568 inline → helper at L1657) - `CausalSelfAttention._capture_attn_out_ortho` (L1889 inline → helper) - `MLP._capture_mlp_out_ortho` (L1985 inline → helper) - `MoSLowRankOutputHead._capture_mos_diagnostics` (L2300-2332 inline → helper) The `is_master` computation (dist.get_rank()) is moved inside the MoS helper so the dist call is also out of the compiled mos_head.forward graph. Fix openai#4 — iter 104 AdaSplash kernel SIGABRT under compile+DDP+RevDEQ. The Triton kernel + dynamo's stream/context management + DDP all-reduce + RevDEQ custom autograd interact at C-level producing signal 6 (no Python traceback) when α first exceeded 1.0 post-warmup. Solution: factor the kernel call into `_adasplash_kernel_call` decorated with `@dynamo_disable`. The outer dispatch function keeps the alpha<=1.0 dense fallback compile-traceable; only the Triton invocation runs in pure eager. Adds head_dim ∈ {16,32,64,128,256} guard with dense-SDPA fallback (kernel asserts on other dims, e.g. 96 = 768/8). Result (verification profile v3, 15 iters, dev 2× L40S): step_avg 28.7s → 23.45s = -5.25s (-18.3%) recompile 6+ → 4 (residual is list-length guard from K-jitter, not flag) graph_break N → 0 smoke_test PASS (loss 7.04 → 4.67 over 300 steps) Cumulative throughput vs original: 28.7s → 23.45s (1000-step training: 8.0h → 6.5h, ~1.5h saved). Iter 104 path is now SAFE: AdaSplash kernel runs without SIGABRT under compile+DDP+RevDEQ; head_dim guard prevents kernel-assert at d_head=96. To actually fire AdaSplash, num_heads must be set so head_dim ∈ {16,32,64,128,256}. Also adds GPU-preflight rule (per user directive 2026-04-28): future sessions must `pgrep` + `nvidia-smi` before any GPU launch — silent contention masked prior speedup measurements. CLAUDE.md §0 + memory MEMORY.md indexed. Residual recompile vector (deferred to Fix #3b, task openai#107): list-length guards on `self._mlp_expert_weights_per_iter` at L2509 from K-jitter × incremental append inside the compiled K-loop. Predicted incremental: 23.45s → 18-20s (~10-15% additional) via dynamo_disabled append helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

m0at added 2 commits March 19, 2026 10:42

Update submission: Int6+zstd MLP1500 + sliding window + FA3 + QAT

71cd776

Post-quant sliding window BPB: 1.1747 (-0.0497 vs baseline 1.2244) 21.4M params in 15.98MB via int6 quantization + zstd compression. 8xH100 SXM, 9473 steps in 570s + 30s QAT + sliding window eval.

m0at changed the title ~~FA3 + Post-Training QAT~~ Int6+zstd MLP1500 + FA3 + Sliding Window + QAT (val_bpb=1.1747) Mar 19, 2026

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Update: val_bpb=1.1648, tuned LRs (scalar=0.02, embed=0.03)

014814c

MLP_HIDDEN=1488, 15.93MB. 9918 steps in 570s (57ms/step). LR tuning from PR openai#99: scalar_lr 0.04->0.02, embed_lr 0.05->0.03. Improvement vs baseline: -0.0596 BPB.

m0at changed the title ~~Int6+zstd MLP1500 + FA3 + Sliding Window + QAT (val_bpb=1.1747)~~ Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648) Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648)#107

Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648)#107
m0at wants to merge 3 commits intoopenai:mainfrom
m0at:submission/fa3-qat

m0at commented Mar 19, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

m0at commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (8xH100 SXM, RunPod)

Configuration

Test plan

Uh oh!

MatoTeziTanka commented Apr 12, 2026

PR #107 Review

Code Analysis

train_gpt.py Checks

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

m0at commented Mar 19, 2026 •

edited

Loading