Skip to content

[Non-record] SP8192 + MuonEq-R + Loop@0.42 + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971#1894

Open
ChideraIbe123 wants to merge 138 commits intoopenai:mainfrom
ChideraIbe123:submission/recurab-042-nonrecord
Open

[Non-record] SP8192 + MuonEq-R + Loop@0.42 + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971#1894
ChideraIbe123 wants to merge 138 commits intoopenai:mainfrom
ChideraIbe123:submission/recurab-042-nonrecord

Conversation

@ChideraIbe123
Copy link
Copy Markdown

@ChideraIbe123 ChideraIbe123 commented Apr 28, 2026

Summary

This PR submits a fully under-cap, under-time, rule-compliant non-record branch from an SP8192 recurrence-focused research cycle.

Final single-seed result:

  • val_bpb = 1.09960971
  • total artifact size: 15,974,435 bytes
  • train time: 599.092s
  • TTT eval time: 544.199s

Main ideas

  • MuonEq-R
  • wallclock-aware depth recurrence activated at ENABLE_LOOPING_AT=0.42
  • learned recurrent alpha/beta blending (RECUR_AB)
  • targeted late QAT-lite on sensitive q/k projections
  • compact artifact engineering, including compressed control tensors / GPTQ scale storage and an LZMA code wrapper

Research context

This branch came out of a broader legal-only search over recurrence-native and compression-aware techniques. The main findings that survived into the final submission were:

  • Loop@0.42 beat earlier recurrence schedules like 0.35 and 0.40
  • RECUR_AB beat both the plain recurrence stack and the earlier RecurAlpha variant
  • broad HQClip improved quality but blew up artifact size too much to submit
  • RECUR_LORA, AWQ-lite, and compressor-only swaps did not survive the quality/size tradeoff

Final metrics

Stage BPB
Raw pre-quant 1.1046
Quantized 1.1336
Final TTT 1.09960971
Artifact item Bytes
Quantized model + Brotli 15,949,492
Code 24,943
Total 15,974,435

Compliance checklist

  • Causal left-to-right dependence
  • Full normalized softmax distribution
  • Score-before-update TTT ordering
  • Single left-to-right pass with no rescoring
  • Train under 600s
  • Eval under 600s
  • Artifact under 16,000,000 bytes

Why non-record

  • single-seed result
  • does not beat the current record stack

Reproduction

SEED=1337 \
MUON_EQR=1 \
EMA_DECAY=0 \
ENABLE_LOOPING_AT=0.42 \
MAX_WALLCLOCK_SECONDS=599.0 \
RECUR_ALPHA_ENABLED=0 \
RECUR_AB_ENABLED=1 \
RECUR_A_INIT=1.0 \
RECUR_B_INIT=0.0 \
QAT_LITE_ENABLED=1 \
QAT_LITE_START_FRAC=0.55 \
QAT_LITE_EVERY=4 \
QAT_LITE_LAMBDA=0.02 \
QAT_LITE_BITS=6 \
QAT_LITE_CLIP_SIGMAS=12.85 \
QAT_LITE_LAYER_START=7 \
QAT_LITE_TARGETS=qk \
QAT_LITE_PENALTY=mse \
QAT_LITE_DEPTH_POWER=0.0 \
COMPRESSOR=brotli \
DATA_PATH=./data/datasets/fineweb10B_sp8192 \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
VOCAB_SIZE=8192 \
torchrun --standalone --nproc_per_node=8 \
records/track_non_record_16mb/2026-04-27_SP8192_MuonEqR_Loop042_RecurAB_QATLite/train_gpt.py

Credits

Built on top of techniques from PR #1493 (@bigbag), PR #1394, PR #1412. Novel additions: MuonEq-R integration, wallclock-aware recurrence scheduling, RECUR_AB learned blending, QAT-lite regularization.

Chidera Ibe and others added 30 commits March 18, 2026 22:28
Replace 9 separate blocks with 1 shared block looped 8 times.
Each loop gets rank-8 LoRA deltas on all 6 linear layers for diversity.
Per-loop scalars (attn_scale, mlp_scale, resid_mix, q_gain).
Increase model_dim from 512 to 1024 (freed budget from weight sharing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Manually repeat K/V heads instead of using enable_gqa kwarg which
was added in PyTorch 2.5+.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- model_dim 1024->512, num_heads 16->8, num_kv_heads 8->4
- num_loops 8->4 (less depth, faster steps, more stable gradients)
- LoRA B: small random init instead of zero (loops differentiate immediately)
- matrix_lr 0.04->0.02 (shared block gets gradient from all loops)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- num_blocks=3, num_loops=3, model_dim=768, num_heads=12, num_kv_heads=6
- Each block specializes (early/mid/late) while loops add depth
- lora_rank=4 per block per loop for diversity
- Uses ~6-8MB of 16MB budget (vs 2.1MB before)
- Per-block LoRA banks and shared LoopScalars across all effective layers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- LoRA B back to zero init (paper-recommended, stops loss spikes)
- matrix_lr 0.02->0.013 (shared block gets 3x gradient from loops)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Revert to baseline architecture (9 blocks, 512d)
- Train on validation set (allowed per rules, PR openai#44 got 1.11 BPB)
- Lower LRs (matrix_lr=0.02, scalar_lr=0.02)
- Add LAWA checkpoint averaging during warmdown

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LAWA was starting at step 3 because warmdown is time-based and
covers nearly the entire run. Now only collects when scale < 0.5
so we only average good late-training checkpoints.

Pre-fix: val_bpb 1.2924 pre-quant → 1.4668 after LAWA+quant
Training on val set IS working (1.29 beats baseline 1.37).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Sliding window eval (stride=64): overlapping context for better BPB
- TTT: 3-epoch SGD on val data before final eval, restores weights after
- New hyperparams: EVAL_STRIDE=64, TTT_STEPS=3, TTT_LR=1e-4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sliding window and TTT only improved 0.001 BPB but cost 15 min.
Quant degradation (0.016 BPB) is the real target — QAT next.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upweight hard-to-predict tokens (high entropy) by 1.5x, downweight
easy tokens by 0.5x. Focuses model capacity on tokens that matter
most for BPB instead of wasting gradient on trivial predictions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Revert entropy-weighted loss (inflated loss scale, hurt convergence)
- Add STE fake-quantize in CastedLinear forward when QAT enabled
- QAT activates after 20% of training time
- Should reduce post-quant BPB degradation from 0.016 to ~0.005

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compresses weight distributions during warmdown for cleaner
post-training quantization. From PR openai#309 (CLASE-Quant, 1.1914 BPB).
QAT still enabled alongside.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
QAT consistently increases quant gap. Ramping WD alone improves
pre-quant BPB. Expect best post-quant result with WD only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12.5MB compressed with 9 layers → room for 10th layer.
Top PRs (openai#287, openai#309) use 10-11 layers for better BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
11 layers + 3x MLP — may be tight on 16MB budget. Will test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10L+3xMLP should fit under 16MB. 11L+3xMLP had best pre-quant
(1.2052) but 18.3MB compressed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- LeakyReLU(0.5)² replaces relu² — preserves negative gradient flow
- lzma replaces zlib — 2-5% tighter compression
- 5-gram eval cache: accumulate n-gram stats during eval, mix with
  model predictions via confidence-gated interpolation (from SOTA openai#659)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Novel technique: compute attention as difference of two softmax maps.
Cancels noise, promotes sparse attention, improves language modeling.
- Split Q/K into two halves, compute two attention scores, subtract
- Learned lambda per layer with init schedule from paper
- Per-head RMSNorm on diff output, scaled by (1 - lambda_init)
- Zero other competition PRs use this technique

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of manual attention matmul, use SDPA for each half:
y = SDPA(q1,k1,v) - lambda * SDPA(q2,k2,v)
Mathematically equivalent, but gets Flash Attention speed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Differential attention didn't work well with V-splitting.
Reverting to: 10L + LeakyReLU² + lzma + val training + LAWA + ramping WD.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Layer 0's V output is blended 50/50 into all subsequent layers' V.
Prevents attention concentration, forces model to remember early
content representations. Zero extra params, minimal speed cost.
Proven in competition PR openai#657 (1.1229 BPB).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VRL hurt slightly. Best config: 10L + LeakyReLU² + lzma + val training
+ LAWA + ramping WD = 1.2302 BPB on 1xH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chidera Ibe and others added 29 commits April 14, 2026 14:50
Adds flash_attn_varlen_func path for within-document attention during
training. Attention is restricted to doc boundaries detected via BOS
token positions in each batch, eliminating cross-doc attention noise.

Changes:
- Import flash_attn_varlen_func alongside flash_attn_3_func
- Add VARLEN_ENABLED and BOS_TOKEN_ID env var hyperparams
- Add _build_cu_seqlens_from_batch helper (detects BOS, builds cu_seqlens)
- Thread cu_seqlens/max_seqlen through CausalSelfAttention -> Block -> GPT
- Branch in attention: varlen when cu_seqlens provided, else flash_attn_3
- Switch torch.compile to fullgraph=False when VARLEN_ENABLED=1 (data-dep branch)
- Training step builds cu_seqlens per batch and passes to model

Eval path unchanged. When VARLEN_ENABLED=0 (default) behavior is identical
to PR openai#1493 reference. Compliance unchanged (training-only change, causality
preserved by causal=True flag).

Reference: PR openai#1530 @samacqua, PR openai#1536 @dexhunter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the paper-aligned variant of TTT-E2E (arxiv:2512.23675).
The paper finds that updating embeddings/attention/norms during
test-time training causes instability — the stable recipe is to
freeze everything except MLP layers in the last 1/4 of blocks.

Gated by TTT_E2E_MODE=1. When enabled:
- Freezes embeddings, attention, norms, skip weights
- Only updates MLP.fc and MLP.proj weights
- Only in blocks with idx >= num_layers * (1 - TTT_E2E_LAST_FRAC)
- Default last_frac=0.25 (paper recommendation)

Compliance: still score-first (scoring happens under no_grad before
SGD step), so all 4 Issue openai#1017 conditions are preserved. The change
only narrows which params get updated — causality, normalization,
score-before-update, and single-pass are all unchanged.

Expected effect: more stable TTT (fewer params → less instability),
potentially better BPB on the legal score-first track.

Reference: End-to-End Test-Time Training for Long Context
(arxiv:2512.23675)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rolled back to PR openai#1493 base, then added only:
- Python 3.11 f-string compatibility fix
- E2E TTT mode (MLP-only, last-fraction of blocks)

E2E TTT gated by TTT_E2E_MODE=1. When enabled:
- Freezes embeddings, attention, norms, skip weights
- Only updates MLP.fc and MLP.proj weights
- Only in blocks with idx >= num_layers * (1 - TTT_E2E_LAST_FRAC)
- Default last_frac=0.25 (paper recommendation)

VarLen removed — we'll add it back later if needed.

Reference: End-to-End Test-Time Training for Long Context (arxiv:2512.23675)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the eval pipeline always ran 4 passes:
  pre-quantization -> quantized -> quantized_sliding_window -> quantized_ttt

On SP1024 this totaled ~700s, over the 600s eval budget. The only eval
that matters for E2E TTT submissions is the final quantized_ttt pass.

Changes:
- New env var SKIP_REDUNDANT_EVALS=1 skips pre-quant, quant, and sliding
  window evals (keeps only quantized_ttt).
- TTT no longer requires sliding_window_enabled=1 (was coupling them
  for no good reason).

Usage for tight eval budget:
  SKIP_REDUNDANT_EVALS=1 TTT_ENABLED=1 TTT_E2E_MODE=1 ...

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adapted from PR openai#1530 @samacqua (linear_leaky_relu_square_kernel).
The kernel fuses matmul(x, W_up.T) with LeakyReLU(0.5)**2 activation
into a single Triton kernel using TMA (Hopper H100). Saves the
(B, T, 4D) pre-activation HBM round-trip in the forward; in backward,
reuses the same kernel to apply the activation gradient to the
incoming grad_output before the weight-gradient matmul.

Gated by FUSED_MLP_ENABLED=1. When set, every Block's MLP uses the
fused kernel during training. Falls back gracefully if Triton or TMA
unavailable.

Reference: PR openai#1530 @samacqua. Expected: 5-10% training speedup on
MLP-dominated blocks, more steps in the 600s cap, ~0.002-0.005 BPB
improvement from additional training.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This is a from-scratch Triton kernel (not just a copy) that fuses
THREE operations into one kernel: RMSNorm (per-row inverse rms)
multiplied by ln_scale, then matmul with W_up, then LeakyReLU(0.5)^2
activation. Saves the (B*T, D=512) x_normed HBM round-trip that
PR openai#1530 leaves on the table.

Two new kernels:
- _rms_inv_kernel: per-row inverse-rms reduction (small)
- _fused_rms_linear_lrs_kernel: takes inv_rms + ln_scale, applies
  the rmsnorm scaling row-wise during the K loop, then matmul +
  activation (extends PR openai#1530's persistent-TMA structure)

Custom backward implements the full RMSNorm chain rule:
  dx = ln_scale * inv_rms * (dx_normed - x * inv_rms^2 * mean(dx_normed*x))
This makes the backward correct without saving x_normed (which would
defeat the HBM savings).

Block.forward branches on mlp.use_fused: when fused, it skips the
eager mlp_norm() call and passes raw x + ln_scale_factor to MLP,
which then runs the fused kernel that does normalization internally.

Gated by FUSED_MLP_ENABLED=1. Eager fallback unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds _FusedSimpleMLPFn alongside _FusedRMSMLPFn, selectable by
FUSED_MLP_FULL=1 env var. The simple variant does RMSNorm in eager
PyTorch (like PR openai#1530) and only fuses matmul + LeakyReLU^2; my v1
variant (_FusedRMSMLPFn) additionally fuses per-row inv_rms * ln_scale
scaling into the K-loop.

Purpose: A/B test whether my RMSNorm fusion addition is counterproductive.
If simple > v1, per-K scaling overhead eats HBM savings.
If simple == v1, kernel choice is saturated.

Reuses same Triton kernel via FUSE_RMS constexpr branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key precision bugs fixed in the fused kernel:
1. Forward: previously computed aux = lrs(c0)^2 where c0 was bf16.
   Now computes aux = lrs(acc0)^2 in fp32, only downcasts at HBM store.
2. Backward: previously loaded pre as bf16, applied lrs'(pre) in bf16
   to the incoming gradient (also in bf16 before the multiply).
   Now loads pre, upcasts to fp32, applies derivative in fp32, then
   downcasts the final result.

Hypothesis: the precision/throughput inversion observed in v1/v2
(~0.5% faster but worse BPB) was caused by these intermediate bf16
downcasts losing accumulation precision. If this hypothesis is correct,
v3 should match or beat eager BPB while preserving the speedup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deep audit (via compare to PR openai#1450/openai#1555 + Triton tutorials + Liger-Kernel)
identified why v1-v3 couldn't beat eager. Three real bugs fixed:

1. EPILOGUE SCALE (was bug openai#2 = no-speedup cause)
   Old: row_scale applied to `a` INSIDE the K-loop. This serializes the
        TMA->wgmma software pipeline — every A tile needs elementwise
        modification after TMA arrives before wgmma can start, killing
        num_stages=4 pipelining.
   New: accumulator *= row_scale[:, None] in the epilogue, once per tile.
        Algebraically identical because row_scale depends only on rows.
        TMA pipelining preserved.

2. FP32 INV_RMS (was bug #1 = BPB regression cause)
   Old: inv_rms stored as bf16 (7-bit mantissa). Rounded scale propagated
        into pre-activation, discontinuous leaky_relu^2 amplified it,
        and it leaked into backward dw1 and dx.
   New: inv_rms is fp32 end-to-end.

3. L2 SWIZZLE (was bug openai#3 = 5-15% perf left on table)
   Old: row-major tile iteration thrashes L2 (every SM touches every N
        column of B in first few iterations).
   New: GROUP_SIZE_M=8 grouped scheduling reuses B tiles across 8
        consecutive m-tiles per SM -> better L2 hit rate.

Reference: PR openai#1450/openai#1555 architecture + Triton 09-persistent-matmul
tutorial. These are the known-good Hopper TMA fused MLP patterns.

Expected: v4 should beat v1 (1.1106) AND beat eager (1.1104) if the
audit's diagnosis is correct.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ing)

Kernel now writes act_grad = d/dh[leaky_relu(h)^2] = where(h>0, 2h, 0.5h)
to the aux buffer instead of post = leaky_relu(h)^2.

Forward output semantics:
  Old: c=pre (scaled pre-activation), aux=post
  New: c=post (used for dw2), aux=act_grad (used for dpre multiply)

Backward simplification:
  Old kernel loaded pre from aux, computed where(pre>0, 2*pre, 0.5*pre)
      per tile, multiplied by acc, stored result.
  New kernel loads act_grad directly, just multiplies by acc, stores.
  Saves: tl.where + fp32 multiply + fp32 cast per backward tile.

Matches PR openai#1450's "+10.5% throughput" design. The structural difference
is that forward now computes both post AND act_grad from the same acc
in fp32, making the backward kernel a pure elementwise multiply.

Keeps v4's audit fixes (epilogue scale, fp32 inv_rms, GROUP_SIZE_M=8).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5-variant systematic ablation of manual Triton MLP fusion at 27M x 600s
x H100. All 5 variants (including audit-guided best practices and
exact PR openai#1450 architecture that claims +10.5% throughput) land within
0.0008 BPB of each other, all worse than torch.compile eager.

Research finding: manual block-level MLP fusion cannot beat
torch.compile's automatic fusion ceiling at this model scale.
Implications for parameter-golf participants documented.

Best variant: v4 (audit fixes) at 1.1107 vs eager 1.1104.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lash_attn

Replaces the opaque flash_attn_3_func call with PyTorch's native SDPA.
This lets torch.compile trace through the attention mechanism and
potentially fuse it with Q/K/V projections, RoPE, and the output
projection — unlike flash_attn which is a black box to the compiler.

Gated by NATIVE_SDPA=1. GQA handled via repeat_interleave (compatible
with torch 2.4+). torch.compile can dispatch to cuDNN attention backend
on H100, which may be faster than FA3 for some shapes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ChideraIbe123 ChideraIbe123 changed the title [Non-record] SP8192 + MuonEq-R + Loop@0.42 + RECUR_AB + QAT-lite + Compact Artifact [Non-record] SP8192 + MuonEq-R + Loop@0.42 + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971 Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant