Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648)#107
Open
m0at wants to merge 3 commits intoopenai:mainfrom
Open
Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648)#107m0at wants to merge 3 commits intoopenai:mainfrom
m0at wants to merge 3 commits intoopenai:mainfrom
Conversation
Flash Attention 3 (Hopper kernels) for ~8% faster steps + post-training quantization-aware training reduces int8+zlib penalty from +0.007 to +0.002 BPB. 3 seed runs (1337, 42, 7) with mean post-quant val_bpb=1.2245. Improvement vs local baseline: -0.0055 BPB / -0.0093 val_loss nats.
Post-quant sliding window BPB: 1.1747 (-0.0497 vs baseline 1.2244) 21.4M params in 15.98MB via int6 quantization + zstd compression. 8xH100 SXM, 9473 steps in 570s + 30s QAT + sliding window eval.
MLP_HIDDEN=1488, 15.93MB. 9918 steps in 570s (57ms/step). LR tuning from PR openai#99: scalar_lr 0.04->0.02, embed_lr 0.05->0.03. Improvement vs baseline: -0.0596 BPB.
PR #107 ReviewTitle: Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648) Code Analysistrain_gpt.py Checks
VerdictClassification: PURE_NEURAL_CLEAN Recommendation: Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: ✓ MERGE Draft by @MatoTeziTanka for parameter-golf review sweep (2026-04-11) Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier. |
jzmyres
pushed a commit
to jzmyres/parameter-golf
that referenced
this pull request
Apr 29, 2026
…stic gates + AdaSplash kernel wrap Applies the principled remediation for the remaining recompile vectors found in profile_v2 (post commit 7343c06): Fix openai#3 — _ROUTER_DIAGNOSTICS_ACTIVE flag-flip recompiles. The flag toggles at every diagnostic emission (~every 10 train steps). dynamo guards on the global inside compiled forward → toggle invalidates cache slot. Solution: hoist each gated diagnostic block into a `@dynamo_disable`-decorated helper method on the owning class. dynamo treats the call as an opaque op (one fixed graph break, no internal-state guards). Sites refactored: - `_should_diag` (L111) — function itself decorated - `SoftDenseRouter._maybe_record_diag` (L1548-1568 inline → helper at L1657) - `CausalSelfAttention._capture_attn_out_ortho` (L1889 inline → helper) - `MLP._capture_mlp_out_ortho` (L1985 inline → helper) - `MoSLowRankOutputHead._capture_mos_diagnostics` (L2300-2332 inline → helper) The `is_master` computation (dist.get_rank()) is moved inside the MoS helper so the dist call is also out of the compiled mos_head.forward graph. Fix openai#4 — iter 104 AdaSplash kernel SIGABRT under compile+DDP+RevDEQ. The Triton kernel + dynamo's stream/context management + DDP all-reduce + RevDEQ custom autograd interact at C-level producing signal 6 (no Python traceback) when α first exceeded 1.0 post-warmup. Solution: factor the kernel call into `_adasplash_kernel_call` decorated with `@dynamo_disable`. The outer dispatch function keeps the alpha<=1.0 dense fallback compile-traceable; only the Triton invocation runs in pure eager. Adds head_dim ∈ {16,32,64,128,256} guard with dense-SDPA fallback (kernel asserts on other dims, e.g. 96 = 768/8). Result (verification profile v3, 15 iters, dev 2× L40S): step_avg 28.7s → 23.45s = -5.25s (-18.3%) recompile 6+ → 4 (residual is list-length guard from K-jitter, not flag) graph_break N → 0 smoke_test PASS (loss 7.04 → 4.67 over 300 steps) Cumulative throughput vs original: 28.7s → 23.45s (1000-step training: 8.0h → 6.5h, ~1.5h saved). Iter 104 path is now SAFE: AdaSplash kernel runs without SIGABRT under compile+DDP+RevDEQ; head_dim guard prevents kernel-assert at d_head=96. To actually fire AdaSplash, num_heads must be set so head_dim ∈ {16,32,64,128,256}. Also adds GPU-preflight rule (per user directive 2026-04-28): future sessions must `pgrep` + `nvidia-smi` before any GPU launch — silent contention masked prior speedup measurements. CLAUDE.md §0 + memory MEMORY.md indexed. Residual recompile vector (deferred to Fix #3b, task openai#107): list-length guards on `self._mlp_expert_weights_per_iter` at L2509 from K-jitter × incremental append inside the compiled K-loop. Predicted incremental: 23.45s → 18-20s (~10-15% additional) via dynamo_disabled append helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked improvements on the NaiveBaseline (1.2244 BPB) -> 1.1648 BPB:
Results (8xH100 SXM, RunPod)
Configuration
Test plan