Skip to content

Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648)#107

Open
m0at wants to merge 3 commits intoopenai:mainfrom
m0at:submission/fa3-qat
Open

Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648)#107
m0at wants to merge 3 commits intoopenai:mainfrom
m0at:submission/fa3-qat

Conversation

@m0at
Copy link
Copy Markdown

@m0at m0at commented Mar 19, 2026

Summary

Stacked improvements on the NaiveBaseline (1.2244 BPB) -> 1.1648 BPB:

  1. Int6 mixed quantization + zstd: MLP/Q/V/proj weights in 6-bit, zstd level 22. 21.3M params in 15.93MB.
  2. MLP hidden=1488 (2.91x): Wider MLP enabled by int6 savings.
  3. Sliding window eval (stride=64, seq_len=2048): ~960 context tokens per scored position.
  4. Post-training QAT (30s): STE reduces quantization penalty.
  5. Tuned LRs: scalar_lr=0.02, tied_embed_lr=0.03, matrix_lr=0.02 (from PR submission: Int6 MLP3x + Late-K Passthrough + SlidingWindow (val_bpb: 1.1605) #99).
  6. FA3 fallback: Graceful fallback to SDPA when FA3 unavailable.

Results (8xH100 SXM, RunPod)

Metric Baseline This submission
post-quant BPB 1.2244 1.1648
improvement -- -0.0596
params 17.1M 21.3M
compressed 15.86MB 15.93MB
steps 13,780 9,918
ms/step 43.5 57.5

Configuration

MLP_HIDDEN=1488 TRAIN_SEQ_LEN=2048 LOWBIT_BITS=6
LOWBIT_NAME_PATTERNS=.mlp.,.attn.c_q.,.attn.c_v.,.attn.proj.
SERIAL_COMPRESSOR=zstd EVAL_STRIDE=64
MATRIX_LR=0.02 SCALAR_LR=0.02 TIED_EMBED_LR=0.03
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000
QK_GAIN_INIT=1.7 GRAD_CLIP_NORM=0.3 POST_QAT_SECONDS=30

Test plan

  • 8xH100 SXM run: 1.1648 BPB
  • Compressed artifact 15.93MB < 16.00MB budget
  • train_gpt.py under 1500 lines (1492)
  • No network calls during training
  • Full train.log included

m0at added 2 commits March 19, 2026 10:42
Flash Attention 3 (Hopper kernels) for ~8% faster steps + post-training
quantization-aware training reduces int8+zlib penalty from +0.007 to +0.002 BPB.

3 seed runs (1337, 42, 7) with mean post-quant val_bpb=1.2245.
Improvement vs local baseline: -0.0055 BPB / -0.0093 val_loss nats.
Post-quant sliding window BPB: 1.1747 (-0.0497 vs baseline 1.2244)
21.4M params in 15.98MB via int6 quantization + zstd compression.
8xH100 SXM, 9473 steps in 570s + 30s QAT + sliding window eval.
@m0at m0at changed the title FA3 + Post-Training QAT Int6+zstd MLP1500 + FA3 + Sliding Window + QAT (val_bpb=1.1747) Mar 19, 2026
MLP_HIDDEN=1488, 15.93MB. 9918 steps in 570s (57ms/step).
LR tuning from PR openai#99: scalar_lr 0.04->0.02, embed_lr 0.05->0.03.
Improvement vs baseline: -0.0596 BPB.
@m0at m0at changed the title Int6+zstd MLP1500 + FA3 + Sliding Window + QAT (val_bpb=1.1747) Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648) Mar 20, 2026
@MatoTeziTanka
Copy link
Copy Markdown

PR #107 Review

Title: Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648)
State: open
Date Reviewed: 2026-04-11

Code Analysis

train_gpt.py Checks

  • target-in-key pattern: not found
  • TTT (Temporal Token Tagging): not found
  • SLOT (Slot MoE): not found
  • Custom Tokenizer: not found

Verdict

Classification: PURE_NEURAL_CLEAN

Recommendation:

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: ✓ MERGE


Draft by @MatoTeziTanka for parameter-golf review sweep (2026-04-11)


Reviewed by @MatoTeziTankaThe Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

jzmyres pushed a commit to jzmyres/parameter-golf that referenced this pull request Apr 29, 2026
…stic gates + AdaSplash kernel wrap

Applies the principled remediation for the remaining recompile vectors found
in profile_v2 (post commit 7343c06):

Fix openai#3 — _ROUTER_DIAGNOSTICS_ACTIVE flag-flip recompiles. The flag toggles at
every diagnostic emission (~every 10 train steps). dynamo guards on the global
inside compiled forward → toggle invalidates cache slot. Solution: hoist each
gated diagnostic block into a `@dynamo_disable`-decorated helper method on the
owning class. dynamo treats the call as an opaque op (one fixed graph break,
no internal-state guards). Sites refactored:
  - `_should_diag` (L111)             — function itself decorated
  - `SoftDenseRouter._maybe_record_diag` (L1548-1568 inline → helper at L1657)
  - `CausalSelfAttention._capture_attn_out_ortho` (L1889 inline → helper)
  - `MLP._capture_mlp_out_ortho` (L1985 inline → helper)
  - `MoSLowRankOutputHead._capture_mos_diagnostics` (L2300-2332 inline → helper)
The `is_master` computation (dist.get_rank()) is moved inside the MoS helper
so the dist call is also out of the compiled mos_head.forward graph.

Fix openai#4 — iter 104 AdaSplash kernel SIGABRT under compile+DDP+RevDEQ. The
Triton kernel + dynamo's stream/context management + DDP all-reduce + RevDEQ
custom autograd interact at C-level producing signal 6 (no Python traceback)
when α first exceeded 1.0 post-warmup. Solution: factor the kernel call into
`_adasplash_kernel_call` decorated with `@dynamo_disable`. The outer dispatch
function keeps the alpha<=1.0 dense fallback compile-traceable; only the
Triton invocation runs in pure eager. Adds head_dim ∈ {16,32,64,128,256}
guard with dense-SDPA fallback (kernel asserts on other dims, e.g. 96 = 768/8).

Result (verification profile v3, 15 iters, dev 2× L40S):
  step_avg     28.7s → 23.45s = -5.25s (-18.3%)
  recompile    6+ → 4 (residual is list-length guard from K-jitter, not flag)
  graph_break  N → 0
  smoke_test   PASS (loss 7.04 → 4.67 over 300 steps)
Cumulative throughput vs original: 28.7s → 23.45s (1000-step training:
8.0h → 6.5h, ~1.5h saved).

Iter 104 path is now SAFE: AdaSplash kernel runs without SIGABRT under
compile+DDP+RevDEQ; head_dim guard prevents kernel-assert at d_head=96. To
actually fire AdaSplash, num_heads must be set so head_dim ∈ {16,32,64,128,256}.

Also adds GPU-preflight rule (per user directive 2026-04-28): future sessions
must `pgrep` + `nvidia-smi` before any GPU launch — silent contention masked
prior speedup measurements. CLAUDE.md §0 + memory MEMORY.md indexed.

Residual recompile vector (deferred to Fix #3b, task openai#107): list-length
guards on `self._mlp_expert_weights_per_iter` at L2509 from K-jitter ×
incremental append inside the compiled K-loop. Predicted incremental:
23.45s → 18-20s (~10-15% additional) via dynamo_disabled append helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants