Skip to content

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287#1798

Closed
leon2k2k2k wants to merge 136 commits intoopenai:mainfrom
leon2k2k2k:submission/036-sparse-updated-carry
Closed

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287#1798
leon2k2k2k wants to merge 136 commits intoopenai:mainfrom
leon2k2k2k:submission/036-sparse-updated-carry

Conversation

@leon2k2k2k
Copy link
Copy Markdown

@leon2k2k2k leon2k2k2k commented Apr 24, 2026

Summary

Results (8×H100 80GB SXM, phased LoRA-TTT, 10-min train / 10-min eval)

Seed Steps Post-EMA (pre-quant) Quantized Post-TTT Artifact (bytes)
42 4989 1.06749 1.07678 1.06366 15,909,254
0 4974 1.06685 1.07608 1.06311 15,904,209
1234 4973 1.06578 1.07509 1.06183 15,909,401
Mean 4979 1.06671 1.07598 1.06287 15,907,621

Frozen Recurrent Carry

The recurrent α/β carry coefficients (first introduced in #1779) were learned end-to-end on a full training run with no validation set involvement, then quantized to 2 decimal places before this promotion run:

  • β = [1.56, 1.85, 2.13]
  • α = [[0.23, 0.04, 0.03], [0.13, −0.34, 0.01], [0.06, 0.19, −0.02]]

Full-precision learned values: β = [1.5610, 1.8531, 2.1320], α = [[0.2314, 0.0388, 0.0347], [0.1260, −0.3438, 0.0145], [0.0557, 0.1934, −0.0172]].

The legality of offline-learned frozen scalars was discussed in #1779 — the data-size budget provides a natural bound on this class of technique.

What this adds over #1779

From #1787 (nprime06):

  • Polar Express Newton-Schulz coefficients
  • MIN_LR=0.10 warmdown floor
  • Fused softcapped CE
  • GPTQ_RESERVE_SECONDS=0.5, VAL_LOSS_EVERY=0

New in this PR:

  • Sparse attention-output gate — replaces the dense GatedAttn with a narrow-input sparse gate
  • Updated frozen recurrent carry — α/β re-learned on the sparse-gate stack and frozen to 2 decimal places (values above)

Rule Compliance

Test Plan

  • Reviewer reproduces any single seed with the provided train_gpt.py and env vars
  • Verify artifact size < 16,000,000 bytes in each seed log
  • Verify score-first TTT ordering in code

🤖 Generated with Claude Code

leon2k2k2k and others added 30 commits April 17, 2026 22:21
Reduces num_layers default from 9 to 7. Adds RECUR_LAYERS env var (default "3,4")
that repeats specified physical layers once after their last occurrence.

With NUM_LAYERS=7, RECUR_LAYERS=3,4:
  block_schedule = [0,1,2,3,4, 3,4, 5,6]  (9 virtual passes, 7 physical blocks)
  param count: ~13.3M vs baseline ~17M — saves ~3.7M params

To run baseline (no recurrence): RECUR_LAYERS="" NUM_LAYERS=9
To run this experiment: defaults work (NUM_LAYERS=7, RECUR_LAYERS=3,4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Keep NUM_LAYERS=9 (same as baseline ~17M params). RECUR_LAYERS=3,4 gives
block_schedule=[0,1,2,3,4,3,4,5,6,7,8] — 11 virtual passes, 9 physical layers.

Apples-to-apples vs baseline: same param budget, more effective depth.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4 structural changes toward SOTA:
1. LeakyReLU(0.5)² instead of ReLU² — better gradient flow
2. Parallel residuals (GPT-J style) from PARALLEL_START_LAYER onward
3. Staged recurrence via RECUR_START_STEP — train plain first, add recurrence later
4. block_schedule_plain stored for pre-recurrence phase

All controlled by env vars:
  PARALLEL_START_LAYER=7 (default, -1 to disable)
  RECUR_START_STEP=0 (default, 0 = always on)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two stable structural improvements:

1. EMA weights (decay=0.997): keep exponential moving average of all
   parameters during training. Swap EMA weights in before save/quantize.
   Smooths out training noise. ~0.002-0.005 bpb expected improvement.

2. Partial RoPE (ROPE_DIMS=16): only apply rotary position embeddings
   to first 16 of 64 head dimensions. Remaining 48 dims are position-free,
   encoding only content. Top submissions all use this.

Config: ROPE_DIMS=16 (default), EMA_DECAY=0.997 (default)
Set ROPE_DIMS=64 to disable partial RoPE (full rotation).
Set EMA_DECAY=0 to disable EMA.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace fixed INT8 clip percentile with per-row search over 5 candidates
[0.999, 0.9995, 0.9999, 0.99999, 1.0]. Pick the clip that minimizes
reconstruction MSE per row.

Fixes EMA quantization catastrophe (1.2570 pre-quant → 1.3485 post-quant)
by adapting to EMA's different weight distribution.

Zero training cost — only runs at save time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
EMA weights quantize catastrophically with both basic INT8 (0.09 bpb loss)
and GPTQ-lite (0.11 bpb loss). Root cause unknown — likely dtype or
torch.compile interaction. Disable until proper GPTQ is implemented.

Partial RoPE 16/64 stays enabled — training curve shows -0.003 improvement.
GPTQ-lite stays in code (helps non-EMA quantization too).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Decoded SOTA submission (1.0810 bpb) from LZMA blob into readable train_gpt_sota.py
- Added checkpoint saving at event boundaries (momentum warmup, warmdown, recurrence, EMA)
- Added temporal checkpoints via CKPT_STEPS env var
- Created run_8xh100_10m.sh for competition conditions
- Updated 2xH100 scripts with auto-log + auto-stop
- Saved baseline as train_gpt_baseline.py for reference

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- BigramHash embedding (3072×112) added to train_gpt_sota.py, disabled by default
- hotstart.py: resume/requant/reeval/reema from checkpoints
- run_8xh100_10m.sh: auto-installs brotli, saves to /workspace/runs/
- Tested on 1×H100: checkpoint saving works (8 checkpoints), resume works
- Infrastructure: US-NE-1 volume (hvpdph5i3g), parameter-golf template needs PUBLIC_KEY env var

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Flattens parameter-golf-upstream into the parameter-golf repo root and adds
the research-ops scaffold on a new research branch:

- CLAUDE.md — repo conventions, two session modes, three-phase loop
- EXECUTION.md — execution protocol (hardware ladder, interview, preflight,
  artifact shape, stop protocol)
- .claude/skills/{research,execution}.md — role activators
- research/{ideas,specs,evaluations}/ — idea/spec/eval lifecycle dirs
- research/specs/000-sota-replication.md — first spec (baseline validation)
- research/ideas/*.md — 6 Stage 1/2 candidates
- runs/ — execution artifact root (checkpoints ignored, stored on NA-1 volume)
- diary/2026-04-19-record-track-kickoff.md — session narrative
- Existing research notes (experiments.md, sota_analysis.md, ideas.md,
  roadmap.md, notes.md, annotations/, logs/) pulled into the repo
- .gitignore merged: upstream data/cache ignores + scaffold runs/ rules

Branching model:
- research (this branch): long-lived, holds scaffold + accumulated
  specs/runs/evaluations
- exp/<slug>: short-lived, one per idea, forked from research; the commit
  hash gets pinned into the corresponding spec

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude Code discovers skills at .claude/skills/<name>/SKILL.md, not
.claude/skills/<name>.md. Restructured research and execution accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CLAUDE.md: add "Branches & worktrees" section documenting the branch
  model (research long-lived, exp/<slug> short-lived per code-change idea)
  and the worktree layout (parameter-golf/worktrees/<slug>).
- EXECUTION.md: explicit note that execution sessions do NOT use worktrees
  — they git checkout the spec's pinned commit on a pod's own clone.
- .gitignore: ignore worktrees/ dir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CLAUDE.md: research sets the smoke requirement in the spec's hardware
  ladder. Required for code changes; may be skipped for hyperparam-only
  specs on already-validated commits (with citation). When in doubt, smoke.
- EXECUTION.md: execution cannot silently skip a rung. If spec marks smoke
  skipped, verify the cited prior run is still current. Otherwise ask user.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1×H100 smoke rung removed as a default. At near-parity cost with 2×H100,
a quick 2×H100 mini gives the same bug-catching signal plus a real bpb
datapoint. Historical diary/experiments entries that mention 1×H100 are
left untouched as historical record.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndary set

Resolved during execution-session interview: pin to 01e6fcf on research,
switch checkpoint policy from "final only" to the 9-file phase-boundary set
for downstream hotstart reuse, and clarify hardware-ladder smoke waiver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both confirmed by decoding the SOTA submission source. Dead config
(ttt_hash_buckets, ttt_hash_embed) noted explicitly. Diff section
clarified as hyperparam-only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cost was training-only thinking (12min/\$3.50). Real wall is ~20min/\$6
once TTT eval (~6min), sliding-window eval (~2min), and EMA/quant are
counted.

architecture.md: full description of current SOTA model — dimensions,
depth recurrence, parallel residuals, TTT, GPTQ, optimizer stack,
checkpointing. Confirmed faithful vs decoded SOTA source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Execution: launched spec 000 on 8×H100 NA-1 pod, commit 01e6fcf, seed 42
with env overrides BIGRAM_VOCAB_SIZE=0 QK_GAIN_INIT=5.25 TTT_ENABLED=1.
Final post-TTT val_bpb 1.08622 — outside accept window [1.079, 1.083]
by +0.0032. Miss is pure throughput: our pod ran at ~85% of SOTA-pod's
step rate in the same 588s training window (3849 steps vs 4550), the
~0.005 bpb deficit tracked cleanly through every eval stage (EMA →
quant → sliding → TTT). Code is faithful; the gap is hardware
variance in Runpod's H100 pool.

Adds:
- Discord monitor helpers at .claude/scripts/{discord_post,discord_post_table}.sh
- EXECUTION.md Pod operations playbook — runpodctl new-form CLI, SSH
  access, setsid launch pattern, env persistence, wallclock budgets,
  throughput variance mitigations, data-path gotcha, kill-fast principle
- EXECUTION.md preflight + stop-protocol updates (brotli, real data
  path, rsync-before-stop ordering)
- runs/000-sota-replication/{final.json, train.log, launch.out,
  notes.md, checkpoints.md} — full run artifacts; 9 phase-boundary
  checkpoints remain on NA-1 volume (2.7 GB, usable as hotstart seeds)

Plus research-session work already in the working tree:
- research/evaluations/000-sota-replication.md — eval writeup
- research/ideas/{hessian-sdclip, per-group-bit-allocation}.md — new ideas
- research/ideas/{bigram-hash, progressive-recurrence}.md — refined
- research/ideas/per-group-quant.md — removed (superseded)
- experiments.md — row appended

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hotstart-only screen on 1×H100 using spec 000's ckpt_final_pre_ema_step3849.
Initial λ values: {0.00, 0.05, 0.10} — conservative low-end probe to test
"does it change anything at all?" before filling in higher values.

Code change on exp/hessian-sdclip @ 74c8385. Hessian reuse required across
λ to halve cost. Execution keeps the pod alive after the 3 initial runs
so the user can drive follow-up λ values live without paying re-setup cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the 1.08622-vs-1.0810 gap (pure throughput artifact, not code),
the decision to adopt 1.08622 as operating baseline, the transition into
actual research phase, and the reasoning behind choosing Hessian-SDClip
as spec 001 (cheapest screen, throughput-independent, clean A/B).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 6 λ values {0.00, 0.05, 0.10, 0.20, 0.40, 0.60} measured on spec-000's
ckpt_final_pre_ema_step3849.pt. Monotonic worsening: Δ from +0.00009 at
λ=0.05 to +0.00158 at λ=0.60. Signal gate not met.

Secondary finding: artifact size exceeds 16MB leaderboard limit at λ≥0.40
(16.02MB, 16.06MB). The `adj = 1 + λ(r−1)` row-scale multiplier hurts
brotli compressibility of int6 matrices.

Validity gate caveat: λ=0.00 produced 1.10518 vs spec-000's 1.10430.
Not a code bug — 1×H100 sees rank-0 calibration shard only vs spec-000's
distributed 8-rank calibration → different Hessian → different GPTQ error
correction. Intra-sweep Δ (same Hessian across all 6) remains valid.

Artifacts:
- runs/001-hessian-sdclip/{summary.md, notes.md, sweep.py, sweep.out,
  lambda_*.json, lambdas.txt}
- On NA-1 volume (not in git): hessians.pt (232 MB, reusable), 6 ×
  lambda_*.ptz (~96 MB total)

Research: evaluation + experiments.md row + promote/iterate/kill decision
is yours.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cheap post-training screen on 1×H100 using spec-000's post-recurrence
warmdown checkpoints {1500, 2275, 3412, 3849}. 6 configs: EMA-only
control + pure SWA (all 4 / late 3) + three SWA/EMA blend ratios.
Both quant and sliding-window eval per config. Hessian reused across
configs (screening approximation).

Code on exp/swa-plus-ema @ 46c2a92. Baseline is in-sweep C0 (~1.10518
expected per spec 001's 1×H100 Hessian calibration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Monotonic worsening across all 6 λ values (+0.00009 to +0.00158 vs
control). Kill signal, not noise. Idea shelved with updated status.

Two side-findings worth preserving:
  1. Artifact size grows with λ (row-scale multiplier reduces Brotli
     efficiency; λ≥0.40 exceeds 16MB limit).
  2. 1-GPU vs 8-GPU calibration gives ~+0.0009 bpb offset on the
     λ=0 no-op path. Cross-hardware absolute bpb is not comparable;
     only intra-sweep Δ is valid. Already accounted for in spec 002.

Cost: $1.90 (~4× over the $0.45 estimate) due to a device-mismatch
bug in the sweep.py Hessian cache reload. Correct pattern ported
to spec 002's swa_sweep.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hyperparam-only spec (no code change). Two paired 2×H100 from-scratch
runs, same pod, same seed, same TRAIN_LOG_EVERY=200. Control has
BIGRAM_VOCAB_SIZE=0 (matches spec 000 baseline); variant has
BIGRAM_VOCAB_SIZE=3072, BIGRAM_DIM=112.

Screens "does BigramHash help?" via matched-step train_loss comparison
AND end-of-training pre-quant val_bpb Δ. Artifact will be oversized
(~16.2MB) — that's fine, this is a signal screen, not a submission.
Budget-fit engineering deferred to spec 004 only if this wins.

Cost: ~$8, ~90 min wall. Early-kill at step 1000 if variant clearly
hurts (saves ~$3).

Can run before/after/parallel to spec 002 — independent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reuse Exp 24's log as the control instead of running a paired baseline.
Variant matches Exp 24's config exactly (QK_GAIN_INIT=5.0, TTT_ENABLED=0,
SEED=1337, TRAIN_LOG_EVERY=100, 40-min wallclock cap) — only difference
is BIGRAM_VOCAB_SIZE=3072.

Saves ~$4 and ~45 min of pod time. Caveat: screens BigramHash on
QK=5.0 instead of QK=5.25 (our spec 000 baseline); the two interventions
are architecturally orthogonal so the signal should transfer, but it's
not bulletproof. Spec 004 (if this promotes) is the proper full-stack
8×H100 run with the spec-000 config.

Compare train_loss at matched step milestones against Exp 24's log.
Accept: variant pre-quant ≤ 1.0847 (Δ ≤ −0.002 vs Exp 24's 1.08670).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 6 configs (C0 EMA-only control, C1 SWA-all-4, C2 SWA-late-3,
C3/C4/C5 SWA/EMA blends at 0.5/0.25/0.75) measured on spec-000's
ckpt_final_pre_ema_step3849.pt with cached Hessian. Quant-only eval
for C1–C5 (sliding skipped after C0's sliding took ~12 min on 1×H100,
4× the spec estimate; kept C0's sliding number as reference).

Clean monotonic worsening with SWA fraction:
  100% EMA (C0): 1.10518 (base)
  75% EMA (C4):  1.11108 (+0.006)
  50% EMA (C3):  1.12251 (+0.017)
  25% EMA (C5):  1.13532 (+0.030)
  0%  EMA (C1):  1.14694 (+0.042)

Signal gate NOT met — all Δ positive. Pure EMA beats every SWA
variant. Likely SOTA's EMA(0.9965) over ~3849 steps is already a
much richer moving average than 4-snapshot uniform SWA, and the
warmdown-era snapshots (1500/2275/3412/3849) are from very
different loss-landscape regions.

Validity gate: C0 reproduced spec-001's λ=0 result BITWISE-EXACTLY
(1.1051789806396541). Pipeline is deterministic on fixed
inputs (checkpoint + seed + calibration).

Cost: ~$3.25 (~2× spec estimate, mostly due to an aborted 8×H100
parallel test — swa_sweep.py hardcodes cuda:0, not DDP-aware, so
torchrun --nproc_per_node=8 made 8 ranks race on GPU 0. $1.60
wasted before I caught it).

Artifacts on volume (not in git):
- /workspace/runs/002-swa-plus-ema-1h-c0/hessians.pt (232 MB)
- quantized_C{0..5}.ptz (~96 MB total)

Research: evaluation + experiments.md row + promote/iterate/kill
decision is yours.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Linear monotonic worsening with SWA fraction (+0.006 at 25% SWA,
+0.042 at 100% SWA). Clean kill. EMA-only remains best config.

Two post-training candidates now dead (Hessian-SDClip + SWA+EMA),
both with clean monotonic-worsening signatures. Post-training
ceiling is very low on our stack — SOTA pipeline is near-optimal.

Spec 003 (BigramHash) now load-bearing for the record push.

Secondary findings:
  1. 1×H100 sliding eval is ~12 min/config, not 3. Recalibrated.
  2. swa_sweep.py (and sweep.py) aren't DDP-aware — multi-GPU
     sweep runs would need ~10 lines of LOCAL_RANK + rank-0 guards.
  3. C0 reproduced spec 001's λ=0 bitwise-exactly — sweep
     infra is deterministic, useful for cross-sweep fingerprinting.

Cost: $2.60 total including $1.60 on an aborted 8×H100 A/B attempt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k and others added 27 commits April 22, 2026 22:15
- Spec: switch to seed 314 (dexhunter's best), add 4xH screen rung, update
  accept criteria vs openai#1769, fix commit description (025c not 025b), fix sanity
  greps to match d70888f's actual per-pass constants
- Eval 026 seed_42: documents full three-stage gap analysis — gap vs openai#1769 is
  entirely in float (seed quality), GPTQ/TTT are equivalent or better
- Experiments: add row 026 with seed 314 queued
- Ideas: mark match-1769-baseline resolved with root cause

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
026 screen seed_314 float was on NE-1 local disk (lost). Use the
026 seed_42 float on JP volume instead. Two runs: A (α=96/WD=0.5
sanity check vs inline 1.06582) + B (α=144/WD=1.0 new stack).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…full-stack

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Training identical to 026 (025b arch, NUM_LOOPS=2, frozen carry). Key fixes
vs 026: PHASED_TTT_ENABLED=3 (026 used =1, slow path), commit c3a99b3
(warm-start-A in TTT), seed 314 (better float). Projected post-TTT ~1.060-1.062.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant