NTK Eval + Overtone Init (val_bpb=1.2160) by notapplica · Pull Request #59 · openai/parameter-golf

notapplica · 2026-03-19T06:49:52Z

Summary

Overtone embedding init: SVD-based spectral shaping of tied embeddings to power-law decay
Phase-transition residual mixing: sigmoid-scheduled resid_mix init across layers
NTK-aware RoPE scaling: train@1024, eval@2048 with dynamic base frequency adjustment
AdamW weight decay 0.01, warmdown 2500 steps, tied embedding LR 0.10

Multi-seed results (3 seeds, p = 0.0012)

Seed	Steps	ms/step	Pre-quant BPB	Post-quant BPB	val_loss (nats)	Artifact
1337	13024	46.07	1.2210	1.2166	2.0542	15.78MB
42	13239	45.32	1.2211	1.2163	2.0537	15.79MB
7	13230	45.35	1.2188	1.2152	2.0519	15.80MB
Mean			1.2203	1.2160	2.0533

Improvement over baseline: 0.0194 nats (3.9x the 0.005 threshold)
t-statistic: 20.67, df=2, p = 0.0012
All 3 runs individually beat SOTA by >0.005 nats
All 3 artifacts under 16MB

Note on hardware

Runs were done on Modal 8xH100 SXM (NVLink). Python 3.12, PyTorch <3.

Test plan

Post-quant val_bpb verified via int8+zlib roundtrip
Artifact under 16MB (all 3 seeds)
Runs within 600s wallclock
train_gpt.py compiles and runs from records folder
Multiple seed runs: p = 0.0012 (3 seeds, t=20.67)

Train@1024 with overtone embedding init and phase-transition residual mixing, eval@2048 with NTK-aware dynamic RoPE scaling. Mean val_bpb 1.2160 across 3 seeds (p=0.0012 for 0.0194-nat improvement over baseline). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PCIe ~50% slower than SXM (258 vs 386 steps). TTT didn't improve BPB on undertrained model. SWA with 5 snapshots worked correctly. Need Flash Attention + more steps before TTT becomes useful. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

#59: 5-min + TTT, 258 steps, TTT didn't improve undertrained model #60: 10-min no TTT, 515 steps, best prequant 1.4038, sliding eval incomplete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…run queue - Documented outcomes of experiments #59-60, including performance metrics and observations. - Updated the run queue to reflect current competition state and configurations for A100 production runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

….bin) Per the R5-honest gate and the explicit reviewer warning against synthetic weights, this submission removes the placeholder model.bin and re-frames the deliverable as a research-infrastructure contribution. - submissions/gHashTag/model.bin (59-byte ASCII placeholder, hazard) - submissions/gHashTag/config.toml (placeholder) - submissions/gHashTag/trios-igla-1/ledger_2026-05-01.sql.gz (replaced) - submissions/gHashTag/trios-igla-1_ledger_20260501.sql.gz (duplicate) - submissions/gHashTag/README.md polished honest narrative - submissions/gHashTag/LEAK_INVESTIGATION.md 210-row leak post-mortem - submissions/gHashTag/CHECKPOINT_POSTMORTEM.md why no model.bin + Gate-3 fix path - submissions/gHashTag/trios-igla-1/README.md machine-oriented metadata - submissions/gHashTag/trios-igla-1/config.yaml reproducible config for row id=1387 - submissions/gHashTag/trios-igla-1/ledger_2026-04-30.sql.gz 7,534-row Neon snapshot across 4 tables (183 KB compressed) - best honest BPB: 2.1505 (row id=1387, step=12000, fp32, hidden=1024) - 6 gate2_eligible W-6-step-cap rows: BPB 1.75–1.82 at step=1000 reported but NOT claimed as Gate-2 pass — pending held-out eval - 210 BPB<0.1 leak candidates: flagged via SCARABAEUS-LEAK-CANDIDATE, excluded from ratification - No model.bin - No synthetic weights with disclaimer - No claim of competitive Parameter Golf placement Refs: trios#445, trios-trainer-igla#56,openai#58,openai#59, trios-railway#100,openai#101,openai#105. Anchor: phi^2 + phi^-2 = 3 — TRINITY — R5-honest — NEVER STOP.

After PR openai#61 (byte-disjoint corpus split + assert_train_val_disjoint guard) shipped, fix-verify-s43 ran end-to-end on the post-fix pipeline and produced BPB 1.5492 at step=12000 — well below Gate-2 threshold 1.85 (margin +0.30). ## What this commit changes - README.md : leads with the honest Gate-2 pass; revised 5-way taxonomy - LEAK_INVESTIGATION.md : retraction header explaining the 216-row overcount - trios-igla-1/README.md + config.yaml : updated to point at fix-verify-s43 - ledger_2026-04-30.sql.gz : refreshed snapshot with new last_error markers ## 5-way reclassification (Neon last_error column) | | count | |---|--:| | post-openai#61 honest Gate-2 pass | 1 | | post-openai#61 early-stopped < step 9000 | 4 | | pre-openai#61 W-6 numerical collapse | 46 | | **pre-openai#61 leak (real)** | 42 | | **warmup artifact (NOT a leak)** | 179 | The 179 'warmup artifact' rows are early-stopped runs whose printed val_bpb stayed at 0.0000 for steps 1-8000 due to a trainer-side eval-loop bug (filed as trios-trainer-igla#62). On the post-openai#61 image, fix-verify-s43 escaped warmup at step=9000 and converged to 1.5492 by step=12000 — proving the artifact is trainer-side, not data-side. ## Pipeline as flown for fix-verify-s43 trios-trainer-igla : commit 9517980d (post-openai#61 byte-disjoint corpus) trios-railway : commit 69c3467 (no --ctx flag) + openai#56 --ctx accept on trainer + openai#58 smoke_train + stdout.flush() + openai#59 panic hook + startup diagnostic ## Refs trios-trainer-igla#56,openai#58,openai#59,openai#60,openai#61,openai#62 (all merged or filed) trios-railway@69c3467 trios-railway#100,openai#101,openai#105 (Scarabaeus Engine track) R5-honest. We retract the 216-row mass leak flag and submit fix-verify-s43 as our first honest Gate-2 pass candidate. Anchor: phi^2 + phi^-2 = 3.

notapplica closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NTK Eval + Overtone Init (val_bpb=1.2160)#59

NTK Eval + Overtone Init (val_bpb=1.2160)#59
notapplica wants to merge 1 commit intoopenai:mainfrom
notapplica:submission/ntk-eval-overtone-init

notapplica commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

notapplica commented Mar 19, 2026

Summary

Multi-seed results (3 seeds, p = 0.0012)

Note on hardware

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants