Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287#1798
Closed
leon2k2k2k wants to merge 136 commits intoopenai:mainfrom
Closed
Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287#1798leon2k2k2k wants to merge 136 commits intoopenai:mainfrom
leon2k2k2k wants to merge 136 commits intoopenai:mainfrom
Conversation
Reduces num_layers default from 9 to 7. Adds RECUR_LAYERS env var (default "3,4") that repeats specified physical layers once after their last occurrence. With NUM_LAYERS=7, RECUR_LAYERS=3,4: block_schedule = [0,1,2,3,4, 3,4, 5,6] (9 virtual passes, 7 physical blocks) param count: ~13.3M vs baseline ~17M — saves ~3.7M params To run baseline (no recurrence): RECUR_LAYERS="" NUM_LAYERS=9 To run this experiment: defaults work (NUM_LAYERS=7, RECUR_LAYERS=3,4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Keep NUM_LAYERS=9 (same as baseline ~17M params). RECUR_LAYERS=3,4 gives block_schedule=[0,1,2,3,4,3,4,5,6,7,8] — 11 virtual passes, 9 physical layers. Apples-to-apples vs baseline: same param budget, more effective depth. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4 structural changes toward SOTA: 1. LeakyReLU(0.5)² instead of ReLU² — better gradient flow 2. Parallel residuals (GPT-J style) from PARALLEL_START_LAYER onward 3. Staged recurrence via RECUR_START_STEP — train plain first, add recurrence later 4. block_schedule_plain stored for pre-recurrence phase All controlled by env vars: PARALLEL_START_LAYER=7 (default, -1 to disable) RECUR_START_STEP=0 (default, 0 = always on) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two stable structural improvements: 1. EMA weights (decay=0.997): keep exponential moving average of all parameters during training. Swap EMA weights in before save/quantize. Smooths out training noise. ~0.002-0.005 bpb expected improvement. 2. Partial RoPE (ROPE_DIMS=16): only apply rotary position embeddings to first 16 of 64 head dimensions. Remaining 48 dims are position-free, encoding only content. Top submissions all use this. Config: ROPE_DIMS=16 (default), EMA_DECAY=0.997 (default) Set ROPE_DIMS=64 to disable partial RoPE (full rotation). Set EMA_DECAY=0 to disable EMA. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace fixed INT8 clip percentile with per-row search over 5 candidates [0.999, 0.9995, 0.9999, 0.99999, 1.0]. Pick the clip that minimizes reconstruction MSE per row. Fixes EMA quantization catastrophe (1.2570 pre-quant → 1.3485 post-quant) by adapting to EMA's different weight distribution. Zero training cost — only runs at save time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
EMA weights quantize catastrophically with both basic INT8 (0.09 bpb loss) and GPTQ-lite (0.11 bpb loss). Root cause unknown — likely dtype or torch.compile interaction. Disable until proper GPTQ is implemented. Partial RoPE 16/64 stays enabled — training curve shows -0.003 improvement. GPTQ-lite stays in code (helps non-EMA quantization too). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Decoded SOTA submission (1.0810 bpb) from LZMA blob into readable train_gpt_sota.py - Added checkpoint saving at event boundaries (momentum warmup, warmdown, recurrence, EMA) - Added temporal checkpoints via CKPT_STEPS env var - Created run_8xh100_10m.sh for competition conditions - Updated 2xH100 scripts with auto-log + auto-stop - Saved baseline as train_gpt_baseline.py for reference Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BigramHash embedding (3072×112) added to train_gpt_sota.py, disabled by default - hotstart.py: resume/requant/reeval/reema from checkpoints - run_8xh100_10m.sh: auto-installs brotli, saves to /workspace/runs/ - Tested on 1×H100: checkpoint saving works (8 checkpoints), resume works - Infrastructure: US-NE-1 volume (hvpdph5i3g), parameter-golf template needs PUBLIC_KEY env var Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Flattens parameter-golf-upstream into the parameter-golf repo root and adds
the research-ops scaffold on a new research branch:
- CLAUDE.md — repo conventions, two session modes, three-phase loop
- EXECUTION.md — execution protocol (hardware ladder, interview, preflight,
artifact shape, stop protocol)
- .claude/skills/{research,execution}.md — role activators
- research/{ideas,specs,evaluations}/ — idea/spec/eval lifecycle dirs
- research/specs/000-sota-replication.md — first spec (baseline validation)
- research/ideas/*.md — 6 Stage 1/2 candidates
- runs/ — execution artifact root (checkpoints ignored, stored on NA-1 volume)
- diary/2026-04-19-record-track-kickoff.md — session narrative
- Existing research notes (experiments.md, sota_analysis.md, ideas.md,
roadmap.md, notes.md, annotations/, logs/) pulled into the repo
- .gitignore merged: upstream data/cache ignores + scaffold runs/ rules
Branching model:
- research (this branch): long-lived, holds scaffold + accumulated
specs/runs/evaluations
- exp/<slug>: short-lived, one per idea, forked from research; the commit
hash gets pinned into the corresponding spec
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude Code discovers skills at .claude/skills/<name>/SKILL.md, not .claude/skills/<name>.md. Restructured research and execution accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CLAUDE.md: add "Branches & worktrees" section documenting the branch model (research long-lived, exp/<slug> short-lived per code-change idea) and the worktree layout (parameter-golf/worktrees/<slug>). - EXECUTION.md: explicit note that execution sessions do NOT use worktrees — they git checkout the spec's pinned commit on a pod's own clone. - .gitignore: ignore worktrees/ dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CLAUDE.md: research sets the smoke requirement in the spec's hardware ladder. Required for code changes; may be skipped for hyperparam-only specs on already-validated commits (with citation). When in doubt, smoke. - EXECUTION.md: execution cannot silently skip a rung. If spec marks smoke skipped, verify the cited prior run is still current. Otherwise ask user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1×H100 smoke rung removed as a default. At near-parity cost with 2×H100, a quick 2×H100 mini gives the same bug-catching signal plus a real bpb datapoint. Historical diary/experiments entries that mention 1×H100 are left untouched as historical record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndary set Resolved during execution-session interview: pin to 01e6fcf on research, switch checkpoint policy from "final only" to the 9-file phase-boundary set for downstream hotstart reuse, and clarify hardware-ladder smoke waiver. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both confirmed by decoding the SOTA submission source. Dead config (ttt_hash_buckets, ttt_hash_embed) noted explicitly. Diff section clarified as hyperparam-only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cost was training-only thinking (12min/\$3.50). Real wall is ~20min/\$6 once TTT eval (~6min), sliding-window eval (~2min), and EMA/quant are counted. architecture.md: full description of current SOTA model — dimensions, depth recurrence, parallel residuals, TTT, GPTQ, optimizer stack, checkpointing. Confirmed faithful vs decoded SOTA source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Execution: launched spec 000 on 8×H100 NA-1 pod, commit 01e6fcf, seed 42 with env overrides BIGRAM_VOCAB_SIZE=0 QK_GAIN_INIT=5.25 TTT_ENABLED=1. Final post-TTT val_bpb 1.08622 — outside accept window [1.079, 1.083] by +0.0032. Miss is pure throughput: our pod ran at ~85% of SOTA-pod's step rate in the same 588s training window (3849 steps vs 4550), the ~0.005 bpb deficit tracked cleanly through every eval stage (EMA → quant → sliding → TTT). Code is faithful; the gap is hardware variance in Runpod's H100 pool. Adds: - Discord monitor helpers at .claude/scripts/{discord_post,discord_post_table}.sh - EXECUTION.md Pod operations playbook — runpodctl new-form CLI, SSH access, setsid launch pattern, env persistence, wallclock budgets, throughput variance mitigations, data-path gotcha, kill-fast principle - EXECUTION.md preflight + stop-protocol updates (brotli, real data path, rsync-before-stop ordering) - runs/000-sota-replication/{final.json, train.log, launch.out, notes.md, checkpoints.md} — full run artifacts; 9 phase-boundary checkpoints remain on NA-1 volume (2.7 GB, usable as hotstart seeds) Plus research-session work already in the working tree: - research/evaluations/000-sota-replication.md — eval writeup - research/ideas/{hessian-sdclip, per-group-bit-allocation}.md — new ideas - research/ideas/{bigram-hash, progressive-recurrence}.md — refined - research/ideas/per-group-quant.md — removed (superseded) - experiments.md — row appended Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hotstart-only screen on 1×H100 using spec 000's ckpt_final_pre_ema_step3849.
Initial λ values: {0.00, 0.05, 0.10} — conservative low-end probe to test
"does it change anything at all?" before filling in higher values.
Code change on exp/hessian-sdclip @ 74c8385. Hessian reuse required across
λ to halve cost. Execution keeps the pod alive after the 3 initial runs
so the user can drive follow-up λ values live without paying re-setup cost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the 1.08622-vs-1.0810 gap (pure throughput artifact, not code), the decision to adopt 1.08622 as operating baseline, the transition into actual research phase, and the reasoning behind choosing Hessian-SDClip as spec 001 (cheapest screen, throughput-independent, clean A/B). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 6 λ values {0.00, 0.05, 0.10, 0.20, 0.40, 0.60} measured on spec-000's
ckpt_final_pre_ema_step3849.pt. Monotonic worsening: Δ from +0.00009 at
λ=0.05 to +0.00158 at λ=0.60. Signal gate not met.
Secondary finding: artifact size exceeds 16MB leaderboard limit at λ≥0.40
(16.02MB, 16.06MB). The `adj = 1 + λ(r−1)` row-scale multiplier hurts
brotli compressibility of int6 matrices.
Validity gate caveat: λ=0.00 produced 1.10518 vs spec-000's 1.10430.
Not a code bug — 1×H100 sees rank-0 calibration shard only vs spec-000's
distributed 8-rank calibration → different Hessian → different GPTQ error
correction. Intra-sweep Δ (same Hessian across all 6) remains valid.
Artifacts:
- runs/001-hessian-sdclip/{summary.md, notes.md, sweep.py, sweep.out,
lambda_*.json, lambdas.txt}
- On NA-1 volume (not in git): hessians.pt (232 MB, reusable), 6 ×
lambda_*.ptz (~96 MB total)
Research: evaluation + experiments.md row + promote/iterate/kill decision
is yours.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cheap post-training screen on 1×H100 using spec-000's post-recurrence
warmdown checkpoints {1500, 2275, 3412, 3849}. 6 configs: EMA-only
control + pure SWA (all 4 / late 3) + three SWA/EMA blend ratios.
Both quant and sliding-window eval per config. Hessian reused across
configs (screening approximation).
Code on exp/swa-plus-ema @ 46c2a92. Baseline is in-sweep C0 (~1.10518
expected per spec 001's 1×H100 Hessian calibration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Monotonic worsening across all 6 λ values (+0.00009 to +0.00158 vs
control). Kill signal, not noise. Idea shelved with updated status.
Two side-findings worth preserving:
1. Artifact size grows with λ (row-scale multiplier reduces Brotli
efficiency; λ≥0.40 exceeds 16MB limit).
2. 1-GPU vs 8-GPU calibration gives ~+0.0009 bpb offset on the
λ=0 no-op path. Cross-hardware absolute bpb is not comparable;
only intra-sweep Δ is valid. Already accounted for in spec 002.
Cost: $1.90 (~4× over the $0.45 estimate) due to a device-mismatch
bug in the sweep.py Hessian cache reload. Correct pattern ported
to spec 002's swa_sweep.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hyperparam-only spec (no code change). Two paired 2×H100 from-scratch runs, same pod, same seed, same TRAIN_LOG_EVERY=200. Control has BIGRAM_VOCAB_SIZE=0 (matches spec 000 baseline); variant has BIGRAM_VOCAB_SIZE=3072, BIGRAM_DIM=112. Screens "does BigramHash help?" via matched-step train_loss comparison AND end-of-training pre-quant val_bpb Δ. Artifact will be oversized (~16.2MB) — that's fine, this is a signal screen, not a submission. Budget-fit engineering deferred to spec 004 only if this wins. Cost: ~$8, ~90 min wall. Early-kill at step 1000 if variant clearly hurts (saves ~$3). Can run before/after/parallel to spec 002 — independent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reuse Exp 24's log as the control instead of running a paired baseline. Variant matches Exp 24's config exactly (QK_GAIN_INIT=5.0, TTT_ENABLED=0, SEED=1337, TRAIN_LOG_EVERY=100, 40-min wallclock cap) — only difference is BIGRAM_VOCAB_SIZE=3072. Saves ~$4 and ~45 min of pod time. Caveat: screens BigramHash on QK=5.0 instead of QK=5.25 (our spec 000 baseline); the two interventions are architecturally orthogonal so the signal should transfer, but it's not bulletproof. Spec 004 (if this promotes) is the proper full-stack 8×H100 run with the spec-000 config. Compare train_loss at matched step milestones against Exp 24's log. Accept: variant pre-quant ≤ 1.0847 (Δ ≤ −0.002 vs Exp 24's 1.08670). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 6 configs (C0 EMA-only control, C1 SWA-all-4, C2 SWA-late-3,
C3/C4/C5 SWA/EMA blends at 0.5/0.25/0.75) measured on spec-000's
ckpt_final_pre_ema_step3849.pt with cached Hessian. Quant-only eval
for C1–C5 (sliding skipped after C0's sliding took ~12 min on 1×H100,
4× the spec estimate; kept C0's sliding number as reference).
Clean monotonic worsening with SWA fraction:
100% EMA (C0): 1.10518 (base)
75% EMA (C4): 1.11108 (+0.006)
50% EMA (C3): 1.12251 (+0.017)
25% EMA (C5): 1.13532 (+0.030)
0% EMA (C1): 1.14694 (+0.042)
Signal gate NOT met — all Δ positive. Pure EMA beats every SWA
variant. Likely SOTA's EMA(0.9965) over ~3849 steps is already a
much richer moving average than 4-snapshot uniform SWA, and the
warmdown-era snapshots (1500/2275/3412/3849) are from very
different loss-landscape regions.
Validity gate: C0 reproduced spec-001's λ=0 result BITWISE-EXACTLY
(1.1051789806396541). Pipeline is deterministic on fixed
inputs (checkpoint + seed + calibration).
Cost: ~$3.25 (~2× spec estimate, mostly due to an aborted 8×H100
parallel test — swa_sweep.py hardcodes cuda:0, not DDP-aware, so
torchrun --nproc_per_node=8 made 8 ranks race on GPU 0. $1.60
wasted before I caught it).
Artifacts on volume (not in git):
- /workspace/runs/002-swa-plus-ema-1h-c0/hessians.pt (232 MB)
- quantized_C{0..5}.ptz (~96 MB total)
Research: evaluation + experiments.md row + promote/iterate/kill
decision is yours.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Linear monotonic worsening with SWA fraction (+0.006 at 25% SWA,
+0.042 at 100% SWA). Clean kill. EMA-only remains best config.
Two post-training candidates now dead (Hessian-SDClip + SWA+EMA),
both with clean monotonic-worsening signatures. Post-training
ceiling is very low on our stack — SOTA pipeline is near-optimal.
Spec 003 (BigramHash) now load-bearing for the record push.
Secondary findings:
1. 1×H100 sliding eval is ~12 min/config, not 3. Recalibrated.
2. swa_sweep.py (and sweep.py) aren't DDP-aware — multi-GPU
sweep runs would need ~10 lines of LOCAL_RANK + rank-0 guards.
3. C0 reproduced spec 001's λ=0 bitwise-exactly — sweep
infra is deterministic, useful for cross-sweep fingerprinting.
Cost: $2.60 total including $1.60 on an aborted 8×H100 A/B attempt.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Spec: switch to seed 314 (dexhunter's best), add 4xH screen rung, update accept criteria vs openai#1769, fix commit description (025c not 025b), fix sanity greps to match d70888f's actual per-pass constants - Eval 026 seed_42: documents full three-stage gap analysis — gap vs openai#1769 is entirely in float (seed quality), GPTQ/TTT are equivalent or better - Experiments: add row 026 with seed 314 queued - Ideas: mark match-1769-baseline resolved with root cause Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…aces float-only check
… (alpha=144, WD=1.0)
026 screen seed_314 float was on NE-1 local disk (lost). Use the 026 seed_42 float on JP volume instead. Two runs: A (α=96/WD=0.5 sanity check vs inline 1.06582) + B (α=144/WD=1.0 new stack). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…full-stack Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Training identical to 026 (025b arch, NUM_LOOPS=2, frozen carry). Key fixes vs 026: PHASED_TTT_ENABLED=3 (026 used =1, slow path), commit c3a99b3 (warm-start-A in TTT), seed 314 (better float). Projected post-TTT ~1.060-1.062. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Results (8×H100 80GB SXM, phased LoRA-TTT, 10-min train / 10-min eval)
Frozen Recurrent Carry
The recurrent α/β carry coefficients (first introduced in #1779) were learned end-to-end on a full training run with no validation set involvement, then quantized to 2 decimal places before this promotion run:
β = [1.56, 1.85, 2.13]α = [[0.23, 0.04, 0.03], [0.13, −0.34, 0.01], [0.06, 0.19, −0.02]]Full-precision learned values:
β = [1.5610, 1.8531, 2.1320],α = [[0.2314, 0.0388, 0.0347], [0.1260, −0.3438, 0.0145], [0.0557, 0.1934, −0.0172]].The legality of offline-learned frozen scalars was discussed in #1779 — the data-size budget provides a natural bound on this class of technique.
What this adds over #1779
From #1787 (nprime06):
MIN_LR=0.10warmdown floorGPTQ_RESERVE_SECONDS=0.5,VAL_LOSS_EVERY=0New in this PR:
GatedAttnwith a narrow-input sparse gateRule Compliance
Test Plan
train_gpt.pyand env vars< 16,000,000bytes in each seed log🤖 Generated with Claude Code