Skip to content

LAWA-EMA frontier fork (pr198 base, SWA -> LAWA val_bpb=1.1551)#201

Open
machdragon wants to merge 1 commit intoopenai:mainfrom
machdragon:submission/lawa-frontier-int6-mlp3x
Open

LAWA-EMA frontier fork (pr198 base, SWA -> LAWA val_bpb=1.1551)#201
machdragon wants to merge 1 commit intoopenai:mainfrom
machdragon:submission/lawa-frontier-int6-mlp3x

Conversation

@machdragon
Copy link
Copy Markdown

@machdragon machdragon commented Mar 20, 2026

Summary

  • val_bpb = 1.1551 (int6 sliding window, stride=64) | 12.7 MB artifact | 8xH100
    SXM, 600s
  • Based on PR 11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318) #198 (11L, int6, MLP3x, relu², FA3, SmearGate, BigramHash, OrthoInit,
    U-Net skips, WD=0.04)
  • SWA replaced by LAWA-EMA (exponential moving average, decay=0.995, float32
    shadow, every-step update)
  • Overtone init added (SVD power-law embedding spectrum for smoother int6
    quantization)
  • Two bug fixes: bigram proj zero-init override, sliding window partial-window overlap
Metric PR #198 This
Int6 sliding val_bpb (s64) 1.1318 1.1551
Int6 roundtrip val_bpb 1.1543 1.1779
Artifact size 15.7 MB 12.7 MB
Steps (600s) 7,412 6,715

Single seed run (seed=1337). Additional seed runs pending for statistical validation.

Test plan

  • Maintainer re-evaluation of train_gpt.py on 8xH100 with official harness
  • Multi-seed validation (seeds 1337, 42, 2025)

Our numbers:

┌────────────────────────────┬────────┐
│ Metric │ Value │
├────────────────────────────┼────────┤
│ Pre-quant val_bpb │ 1.1622 │
├────────────────────────────┼────────┤
│ Int6 roundtrip val_bpb │ 1.1779 │
├────────────────────────────┼────────┤
│ Int6 sliding val_bpb (s64) │ 1.1551 │
└────────────────────────────┴────────┘

Note: our 1.1551 is worse than PR #198's 1.1318. The artifact is 3 MB smaller (12.7 vs
15.7 MB), but the BPB regressed.

11-Layer Int6 + LAWA-EMA (decay=0.995) + Overtone Init, based on PR openai#198.
Replaces SWA with every-step EMA averaging. Fixes bigram proj zero-init
override and sliding window partial-window overlap. 12.7 MB artifact.

8xH100 SXM, 600s, seed=1337, 6715 steps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@machdragon machdragon force-pushed the submission/lawa-frontier-int6-mlp3x branch from 3693d95 to 31c16a6 Compare March 20, 2026 19:14
@machdragon machdragon changed the title Staging: LAWA-EMA frontier fork (pr162 base, SWA→LAWA) LAWA-EMA frontier fork (pr198 base, LAWA-EMA + Overtone Init (val_bpb=1.1551) Mar 20, 2026
@machdragon machdragon changed the title LAWA-EMA frontier fork (pr198 base, LAWA-EMA + Overtone Init (val_bpb=1.1551) LAWA-EMA frontier fork (pr198 base, SWA -> LAWA + Overtone Init (val_bpb=1.1551) Mar 20, 2026
@machdragon machdragon changed the title LAWA-EMA frontier fork (pr198 base, SWA -> LAWA + Overtone Init (val_bpb=1.1551) LAWA-EMA frontier fork (pr198 base, SWA -> LAWA (val_bpb=1.1551) Mar 20, 2026
@machdragon machdragon changed the title LAWA-EMA frontier fork (pr198 base, SWA -> LAWA (val_bpb=1.1551) LAWA-EMA frontier fork (pr198 base, SWA -> LAWA val_bpb=1.1551) Mar 20, 2026
@machdragon machdragon marked this pull request as ready for review March 20, 2026 19:22
machdragon added a commit to machdragon/parameter-golf that referenced this pull request Mar 20, 2026
Built on PR openai#201 (LAWA-EMA + Int6 + Overtone + MLP3x, val_bpb=1.1551).
Adds four improvements targeting quantization fidelity and eval-time adaptation:

- KURE kurtosis regularization + R2 outlier penalty for int6-friendly weights
- Tanh weight reparameterization bounding effective weights to [-1,1]
- Parallel EMA tracks (0.995/0.999/0.9995) with proxy-eval selection
- Causal LoRA TTT (rank 8) ported from PR openai#77 for eval-time adaptation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — LAWA-EMA frontier fork (pr198 base, SWA -> LAWA val_bpb=1.1551)

BPB: 1.1551 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 31c16a6abb90, file records/track_10min_16mb/2026-03-20_LAWA_EMA_Int6_MLP3x_OvertoneInit/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.08s, dim=512, layers=9, vocab=1024, code=65258 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.08s, dim=512, layers=9, vocab=1024, code=65258 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants