Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287 by leon2k2k2k · Pull Request #1798 · openai/parameter-golf

leon2k2k2k · 2026-04-24T01:28:35Z

Summary

3-seed mean val_bpb = 1.06287 (seeds 42, 0, 1234), val_loss = 2.32695 nats/token
−0.00134 vs Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421 #1779 (1.06421, our last submission), −0.00048 vs Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 (1.06335), −0.00262 vs Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736 (1.06549)
Inherits from Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421 #1779; adds a sparse attention-output gate and updated frozen recurrent carry
Stackable with the smear gate and LQER from Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797

Results (8×H100 80GB SXM, phased LoRA-TTT, 10-min train / 10-min eval)

Seed	Steps	Post-EMA (pre-quant)	Quantized	Post-TTT	Artifact (bytes)
42	4989	1.06749	1.07678	1.06366	15,909,254
0	4974	1.06685	1.07608	1.06311	15,904,209
1234	4973	1.06578	1.07509	1.06183	15,909,401
Mean	4979	1.06671	1.07598	1.06287	15,907,621

Frozen Recurrent Carry

The recurrent α/β carry coefficients (first introduced in #1779) were learned end-to-end on a full training run with no validation set involvement, then quantized to 2 decimal places before this promotion run:

β = [1.56, 1.85, 2.13]
α = [[0.23, 0.04, 0.03], [0.13, −0.34, 0.01], [0.06, 0.19, −0.02]]

Full-precision learned values: β = [1.5610, 1.8531, 2.1320], α = [[0.2314, 0.0388, 0.0347], [0.1260, −0.3438, 0.0145], [0.0557, 0.1934, −0.0172]].

The legality of offline-learned frozen scalars was discussed in #1779 — the data-size budget provides a natural bound on this class of technique.

What this adds over #1779

From #1787 (nprime06):

Polar Express Newton-Schulz coefficients
MIN_LR=0.10 warmdown floor
Fused softcapped CE
GPTQ_RESERVE_SECONDS=0.5, VAL_LOSS_EVERY=0

New in this PR:

Sparse attention-output gate — replaces the dense GatedAttn with a narrow-input sparse gate
Updated frozen recurrent carry — α/β re-learned on the sparse-gate stack and frozen to 2 decimal places (values above)

Rule Compliance

Score-first phased TTT (Condition 3), no pre-quant TTT, no n-gram cache
All artifacts ≤ 16 MB (max 15,909,401 bytes), train ≤ 600s, eval ≤ 600s
CaseOps tokenizer (pending issue Clarify which text normalizations are allowed for custom tokenizers #1604, same as Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421 #1779/Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787)

Test Plan

Reviewer reproduces any single seed with the provided train_gpt.py and env vars
Verify artifact size < 16,000,000 bytes in each seed log
Verify score-first TTT ordering in code

🤖 Generated with Claude Code

Reduces num_layers default from 9 to 7. Adds RECUR_LAYERS env var (default "3,4") that repeats specified physical layers once after their last occurrence. With NUM_LAYERS=7, RECUR_LAYERS=3,4: block_schedule = [0,1,2,3,4, 3,4, 5,6] (9 virtual passes, 7 physical blocks) param count: ~13.3M vs baseline ~17M — saves ~3.7M params To run baseline (no recurrence): RECUR_LAYERS="" NUM_LAYERS=9 To run this experiment: defaults work (NUM_LAYERS=7, RECUR_LAYERS=3,4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Keep NUM_LAYERS=9 (same as baseline ~17M params). RECUR_LAYERS=3,4 gives block_schedule=[0,1,2,3,4,3,4,5,6,7,8] — 11 virtual passes, 9 physical layers. Apples-to-apples vs baseline: same param budget, more effective depth. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

4 structural changes toward SOTA: 1. LeakyReLU(0.5)² instead of ReLU² — better gradient flow 2. Parallel residuals (GPT-J style) from PARALLEL_START_LAYER onward 3. Staged recurrence via RECUR_START_STEP — train plain first, add recurrence later 4. block_schedule_plain stored for pre-recurrence phase All controlled by env vars: PARALLEL_START_LAYER=7 (default, -1 to disable) RECUR_START_STEP=0 (default, 0 = always on) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two stable structural improvements: 1. EMA weights (decay=0.997): keep exponential moving average of all parameters during training. Swap EMA weights in before save/quantize. Smooths out training noise. ~0.002-0.005 bpb expected improvement. 2. Partial RoPE (ROPE_DIMS=16): only apply rotary position embeddings to first 16 of 64 head dimensions. Remaining 48 dims are position-free, encoding only content. Top submissions all use this. Config: ROPE_DIMS=16 (default), EMA_DECAY=0.997 (default) Set ROPE_DIMS=64 to disable partial RoPE (full rotation). Set EMA_DECAY=0 to disable EMA. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace fixed INT8 clip percentile with per-row search over 5 candidates [0.999, 0.9995, 0.9999, 0.99999, 1.0]. Pick the clip that minimizes reconstruction MSE per row. Fixes EMA quantization catastrophe (1.2570 pre-quant → 1.3485 post-quant) by adapting to EMA's different weight distribution. Zero training cost — only runs at save time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

EMA weights quantize catastrophically with both basic INT8 (0.09 bpb loss) and GPTQ-lite (0.11 bpb loss). Root cause unknown — likely dtype or torch.compile interaction. Disable until proper GPTQ is implemented. Partial RoPE 16/64 stays enabled — training curve shows -0.003 improvement. GPTQ-lite stays in code (helps non-EMA quantization too). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Decoded SOTA submission (1.0810 bpb) from LZMA blob into readable train_gpt_sota.py - Added checkpoint saving at event boundaries (momentum warmup, warmdown, recurrence, EMA) - Added temporal checkpoints via CKPT_STEPS env var - Created run_8xh100_10m.sh for competition conditions - Updated 2xH100 scripts with auto-log + auto-stop - Saved baseline as train_gpt_baseline.py for reference Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- BigramHash embedding (3072×112) added to train_gpt_sota.py, disabled by default - hotstart.py: resume/requant/reeval/reema from checkpoints - run_8xh100_10m.sh: auto-installs brotli, saves to /workspace/runs/ - Tested on 1×H100: checkpoint saving works (8 checkpoints), resume works - Infrastructure: US-NE-1 volume (hvpdph5i3g), parameter-golf template needs PUBLIC_KEY env var Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Flattens parameter-golf-upstream into the parameter-golf repo root and adds the research-ops scaffold on a new research branch: - CLAUDE.md — repo conventions, two session modes, three-phase loop - EXECUTION.md — execution protocol (hardware ladder, interview, preflight, artifact shape, stop protocol) - .claude/skills/{research,execution}.md — role activators - research/{ideas,specs,evaluations}/ — idea/spec/eval lifecycle dirs - research/specs/000-sota-replication.md — first spec (baseline validation) - research/ideas/*.md — 6 Stage 1/2 candidates - runs/ — execution artifact root (checkpoints ignored, stored on NA-1 volume) - diary/2026-04-19-record-track-kickoff.md — session narrative - Existing research notes (experiments.md, sota_analysis.md, ideas.md, roadmap.md, notes.md, annotations/, logs/) pulled into the repo - .gitignore merged: upstream data/cache ignores + scaffold runs/ rules Branching model: - research (this branch): long-lived, holds scaffold + accumulated specs/runs/evaluations - exp/<slug>: short-lived, one per idea, forked from research; the commit hash gets pinned into the corresponding spec Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Claude Code discovers skills at .claude/skills/<name>/SKILL.md, not .claude/skills/<name>.md. Restructured research and execution accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- CLAUDE.md: add "Branches & worktrees" section documenting the branch model (research long-lived, exp/<slug> short-lived per code-change idea) and the worktree layout (parameter-golf/worktrees/<slug>). - EXECUTION.md: explicit note that execution sessions do NOT use worktrees — they git checkout the spec's pinned commit on a pod's own clone. - .gitignore: ignore worktrees/ dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- CLAUDE.md: research sets the smoke requirement in the spec's hardware ladder. Required for code changes; may be skipped for hyperparam-only specs on already-validated commits (with citation). When in doubt, smoke. - EXECUTION.md: execution cannot silently skip a rung. If spec marks smoke skipped, verify the cited prior run is still current. Otherwise ask user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

1×H100 smoke rung removed as a default. At near-parity cost with 2×H100, a quick 2×H100 mini gives the same bug-catching signal plus a real bpb datapoint. Historical diary/experiments entries that mention 1×H100 are left untouched as historical record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ndary set Resolved during execution-session interview: pin to 01e6fcf on research, switch checkpoint policy from "final only" to the 9-file phase-boundary set for downstream hotstart reuse, and clarify hardware-ladder smoke waiver. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both confirmed by decoding the SOTA submission source. Dead config (ttt_hash_buckets, ttt_hash_embed) noted explicitly. Diff section clarified as hyperparam-only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Cost was training-only thinking (12min/\$3.50). Real wall is ~20min/\$6 once TTT eval (~6min), sliding-window eval (~2min), and EMA/quant are counted. architecture.md: full description of current SOTA model — dimensions, depth recurrence, parallel residuals, TTT, GPTQ, optimizer stack, checkpointing. Confirmed faithful vs decoded SOTA source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Execution: launched spec 000 on 8×H100 NA-1 pod, commit 01e6fcf, seed 42 with env overrides BIGRAM_VOCAB_SIZE=0 QK_GAIN_INIT=5.25 TTT_ENABLED=1. Final post-TTT val_bpb 1.08622 — outside accept window [1.079, 1.083] by +0.0032. Miss is pure throughput: our pod ran at ~85% of SOTA-pod's step rate in the same 588s training window (3849 steps vs 4550), the ~0.005 bpb deficit tracked cleanly through every eval stage (EMA → quant → sliding → TTT). Code is faithful; the gap is hardware variance in Runpod's H100 pool. Adds: - Discord monitor helpers at .claude/scripts/{discord_post,discord_post_table}.sh - EXECUTION.md Pod operations playbook — runpodctl new-form CLI, SSH access, setsid launch pattern, env persistence, wallclock budgets, throughput variance mitigations, data-path gotcha, kill-fast principle - EXECUTION.md preflight + stop-protocol updates (brotli, real data path, rsync-before-stop ordering) - runs/000-sota-replication/{final.json, train.log, launch.out, notes.md, checkpoints.md} — full run artifacts; 9 phase-boundary checkpoints remain on NA-1 volume (2.7 GB, usable as hotstart seeds) Plus research-session work already in the working tree: - research/evaluations/000-sota-replication.md — eval writeup - research/ideas/{hessian-sdclip, per-group-bit-allocation}.md — new ideas - research/ideas/{bigram-hash, progressive-recurrence}.md — refined - research/ideas/per-group-quant.md — removed (superseded) - experiments.md — row appended Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hotstart-only screen on 1×H100 using spec 000's ckpt_final_pre_ema_step3849. Initial λ values: {0.00, 0.05, 0.10} — conservative low-end probe to test "does it change anything at all?" before filling in higher values. Code change on exp/hessian-sdclip @ 74c8385. Hessian reuse required across λ to halve cost. Execution keeps the pod alive after the 3 initial runs so the user can drive follow-up λ values live without paying re-setup cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Covers the 1.08622-vs-1.0810 gap (pure throughput artifact, not code), the decision to adopt 1.08622 as operating baseline, the transition into actual research phase, and the reasoning behind choosing Hessian-SDClip as spec 001 (cheapest screen, throughput-independent, clean A/B). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All 6 λ values {0.00, 0.05, 0.10, 0.20, 0.40, 0.60} measured on spec-000's ckpt_final_pre_ema_step3849.pt. Monotonic worsening: Δ from +0.00009 at λ=0.05 to +0.00158 at λ=0.60. Signal gate not met. Secondary finding: artifact size exceeds 16MB leaderboard limit at λ≥0.40 (16.02MB, 16.06MB). The `adj = 1 + λ(r−1)` row-scale multiplier hurts brotli compressibility of int6 matrices. Validity gate caveat: λ=0.00 produced 1.10518 vs spec-000's 1.10430. Not a code bug — 1×H100 sees rank-0 calibration shard only vs spec-000's distributed 8-rank calibration → different Hessian → different GPTQ error correction. Intra-sweep Δ (same Hessian across all 6) remains valid. Artifacts: - runs/001-hessian-sdclip/{summary.md, notes.md, sweep.py, sweep.out, lambda_*.json, lambdas.txt} - On NA-1 volume (not in git): hessians.pt (232 MB, reusable), 6 × lambda_*.ptz (~96 MB total) Research: evaluation + experiments.md row + promote/iterate/kill decision is yours. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cheap post-training screen on 1×H100 using spec-000's post-recurrence warmdown checkpoints {1500, 2275, 3412, 3849}. 6 configs: EMA-only control + pure SWA (all 4 / late 3) + three SWA/EMA blend ratios. Both quant and sliding-window eval per config. Hessian reused across configs (screening approximation). Code on exp/swa-plus-ema @ 46c2a92. Baseline is in-sweep C0 (~1.10518 expected per spec 001's 1×H100 Hessian calibration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Monotonic worsening across all 6 λ values (+0.00009 to +0.00158 vs control). Kill signal, not noise. Idea shelved with updated status. Two side-findings worth preserving: 1. Artifact size grows with λ (row-scale multiplier reduces Brotli efficiency; λ≥0.40 exceeds 16MB limit). 2. 1-GPU vs 8-GPU calibration gives ~+0.0009 bpb offset on the λ=0 no-op path. Cross-hardware absolute bpb is not comparable; only intra-sweep Δ is valid. Already accounted for in spec 002. Cost: $1.90 (~4× over the $0.45 estimate) due to a device-mismatch bug in the sweep.py Hessian cache reload. Correct pattern ported to spec 002's swa_sweep.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hyperparam-only spec (no code change). Two paired 2×H100 from-scratch runs, same pod, same seed, same TRAIN_LOG_EVERY=200. Control has BIGRAM_VOCAB_SIZE=0 (matches spec 000 baseline); variant has BIGRAM_VOCAB_SIZE=3072, BIGRAM_DIM=112. Screens "does BigramHash help?" via matched-step train_loss comparison AND end-of-training pre-quant val_bpb Δ. Artifact will be oversized (~16.2MB) — that's fine, this is a signal screen, not a submission. Budget-fit engineering deferred to spec 004 only if this wins. Cost: ~$8, ~90 min wall. Early-kill at step 1000 if variant clearly hurts (saves ~$3). Can run before/after/parallel to spec 002 — independent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reuse Exp 24's log as the control instead of running a paired baseline. Variant matches Exp 24's config exactly (QK_GAIN_INIT=5.0, TTT_ENABLED=0, SEED=1337, TRAIN_LOG_EVERY=100, 40-min wallclock cap) — only difference is BIGRAM_VOCAB_SIZE=3072. Saves ~$4 and ~45 min of pod time. Caveat: screens BigramHash on QK=5.0 instead of QK=5.25 (our spec 000 baseline); the two interventions are architecturally orthogonal so the signal should transfer, but it's not bulletproof. Spec 004 (if this promotes) is the proper full-stack 8×H100 run with the spec-000 config. Compare train_loss at matched step milestones against Exp 24's log. Accept: variant pre-quant ≤ 1.0847 (Δ ≤ −0.002 vs Exp 24's 1.08670). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All 6 configs (C0 EMA-only control, C1 SWA-all-4, C2 SWA-late-3, C3/C4/C5 SWA/EMA blends at 0.5/0.25/0.75) measured on spec-000's ckpt_final_pre_ema_step3849.pt with cached Hessian. Quant-only eval for C1–C5 (sliding skipped after C0's sliding took ~12 min on 1×H100, 4× the spec estimate; kept C0's sliding number as reference). Clean monotonic worsening with SWA fraction: 100% EMA (C0): 1.10518 (base) 75% EMA (C4): 1.11108 (+0.006) 50% EMA (C3): 1.12251 (+0.017) 25% EMA (C5): 1.13532 (+0.030) 0% EMA (C1): 1.14694 (+0.042) Signal gate NOT met — all Δ positive. Pure EMA beats every SWA variant. Likely SOTA's EMA(0.9965) over ~3849 steps is already a much richer moving average than 4-snapshot uniform SWA, and the warmdown-era snapshots (1500/2275/3412/3849) are from very different loss-landscape regions. Validity gate: C0 reproduced spec-001's λ=0 result BITWISE-EXACTLY (1.1051789806396541). Pipeline is deterministic on fixed inputs (checkpoint + seed + calibration). Cost: ~$3.25 (~2× spec estimate, mostly due to an aborted 8×H100 parallel test — swa_sweep.py hardcodes cuda:0, not DDP-aware, so torchrun --nproc_per_node=8 made 8 ranks race on GPU 0. $1.60 wasted before I caught it). Artifacts on volume (not in git): - /workspace/runs/002-swa-plus-ema-1h-c0/hessians.pt (232 MB) - quantized_C{0..5}.ptz (~96 MB total) Research: evaluation + experiments.md row + promote/iterate/kill decision is yours. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Linear monotonic worsening with SWA fraction (+0.006 at 25% SWA, +0.042 at 100% SWA). Clean kill. EMA-only remains best config. Two post-training candidates now dead (Hessian-SDClip + SWA+EMA), both with clean monotonic-worsening signatures. Post-training ceiling is very low on our stack — SOTA pipeline is near-optimal. Spec 003 (BigramHash) now load-bearing for the record push. Secondary findings: 1. 1×H100 sliding eval is ~12 min/config, not 3. Recalibrated. 2. swa_sweep.py (and sweep.py) aren't DDP-aware — multi-GPU sweep runs would need ~10 lines of LOCAL_RANK + rank-0 guards. 3. C0 reproduced spec 001's λ=0 bitwise-exactly — sweep infra is deterministic, useful for cross-sweep fingerprinting. Cost: $2.60 total including $1.60 on an aborted 8×H100 A/B attempt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d 42)

- Spec: switch to seed 314 (dexhunter's best), add 4xH screen rung, update accept criteria vs openai#1769, fix commit description (025c not 025b), fix sanity greps to match d70888f's actual per-pass constants - Eval 026 seed_42: documents full three-stage gap analysis — gap vs openai#1769 is entirely in float (seed quality), GPTQ/TTT are equivalent or better - Experiments: add row 026 with seed 314 queued - Ideas: mark match-1769-baseline resolved with root cause Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ed 314

…arm)

…for full 1200s

…aces float-only check

… (alpha=144, WD=1.0)

026 screen seed_314 float was on NE-1 local disk (lost). Use the 026 seed_42 float on JP volume instead. Two runs: A (α=96/WD=0.5 sanity check vs inline 1.06582) + B (α=144/WD=1.0 new stack). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…full-stack Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Training identical to 026 (025b arch, NUM_LOOPS=2, frozen carry). Key fixes vs 026: PHASED_TTT_ENABLED=3 (026 used =1, slow path), commit c3a99b3 (warm-start-A in TTT), seed 314 (better float). Projected post-TTT ~1.060-1.062. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…06287

leon2k2k2k and others added 30 commits April 17, 2026 22:21

add experiment scripts for sweep session

a7d5459

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: move staged recurrence outside forward() for torch.compile

6e2da72

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: manual GQA for PyTorch compat + 2xH100 launch scripts

f776719

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: skills need SKILL.md subdir format, not flat .md

8191cc8

Claude Code discovers skills at .claude/skills/<name>/SKILL.md, not .claude/skills/<name>.md. Restructured research and execution accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

spec 000: add QK_GAIN_INIT=5.25 and TTT_ENABLED=1 to config diff

39094a5

Both confirmed by decoding the SOTA submission source. Dead config (ttt_hash_buckets, ttt_hash_embed) noted explicitly. Diff section clarified as hyperparam-only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rename architecture.md → current_architecture.md

718fbdb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

leon2k2k2k and others added 27 commits April 22, 2026 22:15

remove spec 028 — redundant, spec 009 already ran clip=12 (1.0673 see…

413aeec

…d 42)

memory: dexhunter seed strategy + always-push feedback + gap diagnosis

0509760

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

spec 029: full stack 025b + LoRA warm-start-A + depth curriculum + se…

c84e615

…ed 314

spec 029: update commit to c3a99b3 (curriculum alpha_info fix + pre-w…

f698ef6

…arm)

spec 029: shorten mini to 400s — depth upgrade fires at 80s, no need …

66aac0e

…for full 1200s

spec 029: enable full TTT pipeline on 4xH screen — post-TTT gate repl…

33a8df5

…aces float-only check

spec 028: TTT-only on 026 screen seed 314 float with new TTT settings…

67cf013

… (alpha=144, WD=1.0)

delete spec 027 (lora-warmstart-depth-curriculum): superseded by 029 …

a894152

…full-stack Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

spec 028: PHASED_TTT_ENABLED=1 → 3 (0=slow TTT not disabled)

9bf137d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sdlfkj

24c314b

Add direct-carry calibration variants for spec 031

1cac69b

Add draft 032 alpha-beta recurrence curriculum mode

7efcff5

Record draft implementation commit for spec 032

fb868f2

Freeze spec 032 launch contract

bff18a4

Pin commit hash in spec 032

3e4c6e0

Fix spec 032 launch contract

303a1c3

Log carry snapshots to artifact file

ec2a507

Update spec 032 for carry snapshot artifact

7f805b9

Update research specs and evaluations

97528b4

Add 034c/034d LR schedule specs

0fd61cb

Add 034c execution artifacts and guardrails

d7ae414

Spec 038 smear gate plus LQER asym promotion

26da5c6

Pin spec 038 branch commit

a471c88

Record: openai#1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.…

795c96e

…06287

leon2k2k2k marked this pull request as draft April 24, 2026 01:31

leon2k2k2k closed this Apr 24, 2026

leon2k2k2k mentioned this pull request Apr 24, 2026

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287 #1801

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287#1798

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287#1798
leon2k2k2k wants to merge 136 commits intoopenai:mainfrom
leon2k2k2k:submission/036-sparse-updated-carry

leon2k2k2k commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leon2k2k2k commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (8×H100 80GB SXM, phased LoRA-TTT, 10-min train / 10-min eval)

Frozen Recurrent Carry

What this adds over #1779

Rule Compliance

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leon2k2k2k commented Apr 24, 2026 •

edited

Loading