Skip to content

Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean)#1851

Merged
cocohearts merged 4 commits intoopenai:mainfrom
aquariouseworkman:main
Apr 30, 2026
Merged

Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean)#1851
cocohearts merged 4 commits intoopenai:mainfrom
aquariouseworkman:main

Conversation

@aquariouseworkman
Copy link
Copy Markdown
Contributor

@aquariouseworkman aquariouseworkman commented Apr 27, 2026

Record: SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT

val_bpb = 1.06128 | ~15.95 MB | 8xH100 SXM

Result

Seed Pre-TTT BPB Post-TTT BPB Artifact (bytes)
42 1.07406 1.06128 15,952,086

Merged SOTA (PR #1493): 1.0810 BPP. Delta: -0.0197 BPP. Clears the 0.005-nat threshold.
3-seed validation (seeds 42 / 314 / 1234) is provided in PR #1868, which evaluates this exact record package. The mean across those three seeds is reported there.

Key Change: SmearGate BOS Document Boundary Fix

Builds on PR #1797 stack (PR #1787 base + SmearGate + LQER Asymmetric) but fixes the SmearGate cross-document leakage bug identified by @cocohearts in PR #1797 audit.

The bug: SmearGate 1-token causal lookback does not mask BOS positions, so the final token of document N smears into BOS of document N+1.

The fix (applied in both forward_logits and forward_ttt):

bos_mask = (input_ids[:, 1:] == 1).unsqueeze(-1)
g = g.masked_fill(bos_mask, 0.0)

Technique Stack

Component Origin
CaseOps bijective case transform PR #1729 / PR #1736
SparseAttnGate PR #1787 (nprime06)
SmearGate + BOS fix PR #1797 + this submission
LQER asymmetric rank-4 PR #1797
Phased TTT (score-first, 3 phases) PR #1394 / PR #1736
PolarNS + MIN_LR=0.1 + FusedCE PR #1787
Full Hessian GPTQ + Brotli PR #1019 / PR #1530

Architecture

11L x 512d x 8H/4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: layers 3-5 looped x2 (activated at frac=0.35). Parallel residuals from layer 8. XSA on all 11 layers. SmearGate window=12.

Compliance

  • Artifact <= 16,000,000 bytes: 15,952,086 bytes
  • train_time <= 600s: 599.6s
  • eval_time <= 600s: 519.5s
  • Issue A Field Guide to Valid Submissions #1017 Conditions 1-4: All satisfied. SmearGate BOS mask ensures no cross-document leakage.

Credits

…symmetric + Phased TTT

val_bpb = 1.06128 | ~15.95 MB | 8xH100 SXM

Key Change: SmearGate BOS Document Boundary Fix
Builds on PR openai#1797 stack (PR openai#1787 base + SmearGate + LQER Asymmetric) but fixes the SmearGate cross-document leakage bug identified by @cocohearts in PR openai#1797 audit.

The bug: SmearGate 1-token causal lookback does not mask BOS positions, so the final token of document N smears into BOS of document N+1.

Credits
@nprime06 -- PR openai#1787 base stack
@romeerp -- CaseOps transform (PR openai#1729)
@dexhunter -- SmearGate + LQER (PR openai#1797)
@cocohearts -- Identifying SmearGate BOS bug
@abaybektursun -- Score-first TTT (PR openai#549)
@clarkkev -- GPTQ SDClip + SP8192 (PR openai#1394)
@h1beee
Copy link
Copy Markdown

h1beee commented Apr 27, 2026

need results on 3 different seeds

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 27, 2026
… required; PR openai#1848 BPB risk; Day 18 plateau; Session 23

- Merged SOTA still 1.0810 (Day 18, no change since Apr 9)
- PPM-D byte mixture confirmed by dexhunter at 1.0322 (PR openai#1857, self-closed)
- SmearGate BOS bug documented: prev-token leaks at document boundaries; fix required
- PR openai#1848 (newjordan, 0.87980) flagged BPB risk: sibling PR openai#1846 closed same day
- PR openai#1858 (0.9946) only covers 8M/40.5M tokens — not leaderboard-comparable
- PR openai#1855 (codemath3000, 1.06108) and openai#1851 (aquariouseworkman, 1.06128) both clean
- PPM-D wave: PRs openai#1850, openai#1854, openai#1835 await organizer ruling
- Added Session 23 lessons to CLAUDE.md
- 3 days to deadline (Apr 30) — final GPU run window

https://claude.ai/code/session_01RmJtLYUmKNzDgDVTnWoKzU
@aquariouseworkman
Copy link
Copy Markdown
Contributor Author

aquariouseworkman commented Apr 27, 2026

need results on 3 different seeds

I ran out of credits on RunPod. This amazing person validated the other 2 for me!

see: #1868

Meirzhan05 added a commit to Meirzhan05/parameter-golf that referenced this pull request Apr 28, 2026
Forward-1-token residual mixer at embedding lane:
  x_t <- x_t + lambda * sigmoid(W * x_t[:12]) * x_{t-1}

The model gets a learnable bias toward bigram features without needing
attention to discover it. Tiny (13 params total: 12-wide linear + scalar lambda).
Zero-init lambda = transparent at start.

BOS-fix prevents cross-document leakage during packed training: gate is
masked to 0 at positions where input_ids == BOS_TOKEN_ID (default 1).

Both smear_gate.weight and smear_lambda match 'smear' pattern -> route to
scalar AdamW, not Muon. Both at GPT-level (not blocks), so explicitly
appended to scalar_params in Optimizers.
Fija pushed a commit to Fija/parameter-golf that referenced this pull request Apr 28, 2026
- Adds 2-line BOS mask in both forward_logits and forward_ttt SmearGate
  paths. Before fix, the last token of doc N smeared into the BOS of doc
  N+1 — model-quality bug, not a C1 issue. Identical fix to PR openai#1851
  @aquariouseworkman, audit by @cocohearts.

- runpod/phase_g_3seed.sh: full 3-seed driver. Sets PR openai#1797 stack env
  vars + the PR openai#1855 9-hparam greedy stack delta:
    MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 WARMDOWN_FRAC=0.85
    BETA2=0.99 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 TTT_LORA_RANK=80
    SPARSE_ATTN_GATE_SCALE=0.5 PHASED_TTT_PREFIX_DOCS=2500
  Mixers (NGRAM/TEMP) stay OFF — pure neural baseline + bug fix +
  hparam stack. Auto-runs Welch t-test vs PR openai#1797 (1.06157±0.00066).

- TTT 4-epoch (PR openai#1812) explicitly NOT adopted: that scheme targets the
  PR openai#1493 SGD-on-whole-model TTT path, not the PR openai#1797 LoRA-phased
  per-doc-reset path we're on. No clean mapping.

Legality: all 16/16 unit tests still pass. BOS fix preserves causality
(it only zeroes a gate at positions where current token is BOS, never
references future tokens).
@aquariouseworkman aquariouseworkman changed the title Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (3 seed mean) Apr 28, 2026
@aquariouseworkman
Copy link
Copy Markdown
Contributor Author

aquariouseworkman commented Apr 28, 2026

Correct to last comment (edited)
See for 3 mean test: #1868

@aquariouseworkman aquariouseworkman changed the title Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (3 seed mean) Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) Apr 28, 2026
lijuncheng16 added a commit to lijuncheng16/parameter-golf that referenced this pull request Apr 28, 2026
S9 stack extended with caseops dataset support and LQER (Low-rank
Quantization Error Rescue). 4487 lines vs train_gpt_s9.py's 4363.
This is the script used in PR openai#1851 stage 1/2 ablations (cells A0–F4
in stage 1, Z0/P*/Q*/R* in stage 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 28, 2026
PR openai#1902 (cocohearts) accepted openai#1851/openai#1868 over openai#1736 and excluded openai#1855
only on significance grounds (p=0.325). Our prior 050 line built on openai#1797
which is under validity-cloud per cocohearts. Re-anchor research baseline
on openai#1855's accepted chain.

Pure port — zero modifications. Files copied verbatim from
codemath3000/parameter-golf:submission/sp8192-lqer-bos-smear-fix-9hp-stack
@ 1e43966 into records/track_10min_16mb/2026-04-29_PR1855_Port_Baseline/.

Spec 060B+ will fork exp/060B-* etc. to stack quant-repair / deploy-time
levers (046B-tight SDClip, 046L deploy-time repair, 046G-tighter, etc.)
on this baseline.
lijuncheng16 added a commit to lijuncheng16/parameter-golf that referenced this pull request Apr 28, 2026
5252-line training script reproducing PR openai#1851's stack with extensive
inline annotations (CN comments). Mandatory FA3 import (no SDPA fallback)
and direct Triton kernel use. Sibling to train_gpt_s9*.py variants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 28, 2026
PR openai#1898 (X-Abhishek-X) ran Partial SpinQuant + EMBED_BITS=6 reinvest on
the same chain and reported val_bpb 1.06614 vs their base openai#1851's 1.06128
= +0.00486 REGRESSION. Their PR framed it as -0.01486 vs the 2-week-old
merged SOTA openai#1493 (1.0810) instead of vs their actual parent.

Implications:
- 060G (Partial SpinQuant): empirically null/negative on this chain.
- 060H (EMBED_BITS=6 alone or with LQER reinvest): even riskier without
  SpinQuant's rotation protection.

Both specs marked as DEPRECATED at the top. Not deleted (kept as
documentation for if conditions change later, e.g., deploy-time repair
specifically targeting tok_emb precision).
lijuncheng16 added a commit to lijuncheng16/parameter-golf that referenced this pull request Apr 28, 2026
S9 stack extended with caseops dataset support and LQER (Low-rank
Quantization Error Rescue). 4487 lines vs train_gpt_s9.py's 4363.
This is the script used in PR openai#1851 stage 1/2 ablations (cells A0–F4
in stage 1, Z0/P*/Q*/R* in stage 2).
lijuncheng16 added a commit to lijuncheng16/parameter-golf that referenced this pull request Apr 28, 2026
5252-line training script reproducing PR openai#1851's stack with extensive
inline annotations (CN comments). Mandatory FA3 import (no SDPA fallback)
and direct Triton kernel use. Sibling to train_gpt_s9*.py variants.
aquariouseworkman and others added 2 commits April 29, 2026 02:43
…d mean)

Applies activation-aware mixed-precision GPTQ (from PR openai#1908 / romeerp) on top of codemath3000 PR openai#1855 stack.

## Results

| Seed | val_bpb (post-TTT) | artifact bytes | steps | eval time |
|------|--------------------|----------------|-------|-----------|
| 42   | 1.06118            | 15,978,503     | 4989  | 392.8s    |
| 314  | 1.06005            | 15,976,469     | 4986  | 395.8s    |
| 1234 | 1.06135            | 15,976,673     | 4977  | 395.5s    |
| **mean** | **1.06086**    | —              | —     | —         |

3-seed std: 0.00069. Beats codemath3000 PR openai#1855 (1.06108) by 0.00022 BPB.

## Technique

Training is identical to PR openai#1855. The only change is post-training quantization:

**AWQ-lite (activation-aware GPTQ):**
1. Collect per-input-channel activation RMS during GPTQ calibration
2. Score column groups: `saliency = act_rms * mean(abs(weight))`
3. Select top-1 most salient 64-column group per matrix
4. Quantize that group at int8 inside the same full-tensor GPTQ solve (rest stays int6)

Env vars: `AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64`

## Setup
1. `pip install -r requirements.txt`
2. `apt-get install -y lrzip`
3. Install FA3: `pip install --no-deps flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/`
4. Run `prepare_caseops_data.py` to build the dataset
5. `AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64 torchrun --standalone --nproc_per_node=8 train_gpt.py`

## Environment
- 8xH100 80GB SXM (RunPod)
- PyTorch 2.9.1+cu128
- FlashAttention 3.0.0
- Triton 3.5.1
lijuncheng16 added a commit to lijuncheng16/parameter-golf that referenced this pull request Apr 29, 2026
…t w/ GPTQ v2

3143-line condensed version of train_gpt_s0_pr1851_mod.py (no inline
annotations, GPTQ v2 path). Same mandatory FA3 + Triton dependency as
the annotated sibling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
…ams)

After 4 parallel research agents reviewed 30+ open PRs and
compliance issues, two new findings:

1. PR openai#1923 (AsymLogit) flagged "empirical negative" by
   sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base
   with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo.
   V19's specific stack is NOT directly invalidated.

2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855
   base 1.06108 = -0.00059 BPB). Just 2 hparam env vars:
     MATRIX_LR 0.026 -> 0.028
     PHASED_TTT_PREFIX_DOCS 2500 -> 3500
   Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head).

Adds two new scout scripts:
- run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus
  + WD=2.0 (full stack, recommended first scout)
- run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0
  (ablation if V19c wins partially)

Decision rule (CaseOps val baseline 0.97651, community floor 0.0006):
  V19c < 0.97591 -> CLEAR WIN, run 3-seed
  V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b
  V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884)

Other research findings:
- PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip)
- PR openai#1929 SLOT banned per openai#1722 precedent
- PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent
- cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1
- regina-openai + Alex Zhao 48h zero activity
- CaseOps de-facto legal (PR openai#1855 merged into chain)
Christopher-Lee-McClendon added a commit to Christopher-Lee-McClendon/parameter-golf that referenced this pull request Apr 29, 2026
… 1.5221

- Add full-val Path B result (151,078,222 bytes, claim_ready=true)
- Add formal mathematical description of byte-level vs token-level BPB
- Add comparison with PR openai#1905 (independent normalization invalidity discovery)
- Add Score-First Legal TTT evidence section (PRs openai#461, openai#549, openai#1735, openai#1851)
- Archive Path A as computationally intractable
- Bundle fast_score.py and full-val legality proof
- Fix trie marginalization formula to reflect continuable mass implementation
- Update submission.json with full-val fields

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
lijuncheng16 added a commit to lijuncheng16/parameter-golf that referenced this pull request Apr 29, 2026
Same 3143-line code as v2; only Hyperparameters defaults changed to
match the PR openai#1851 stack tuning observed in stage-1/2 ablations:
  SEED=42  MIN_LR=0.1  TTT_BATCH_SIZE=16  PHASED_TTT_NUM_PHASES=3
  GPTQ_RESERVE_SECONDS=16  EMBED_BITS=7  EMBED_CLIP_SIGMAS=15
  MLP_CLIP_SIGMAS=12  SMEAR_GATE_ENABLED=1  GATED_ATTN_QUANT_GATE=1
  SPARSE_ATTN_GATE_ENABLED=1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@cocohearts cocohearts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted on substance for the original SmearGate BOS-fix submission, but this PR is not merge-ready in its current scope. It now contains a second, later AWQ-lite submission directory, records/track_10min_16mb/2020-04-29_AWQ_lite_mixedprecision_GPTQ, with an invalid 2020 date and a separate ML change that was not part of the accepted #1851 leaderboard row. Please split that newer AWQ-lite work into a separate PR or remove it from this one. This PR should merge only the accepted 2026-04-27 BOS-fix record package, with its 3-seed support either included directly or clearly tied to #1868.

Copy link
Copy Markdown
Collaborator

@cocohearts cocohearts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted on substance for the original SmearGate BOS-fix submission, but this PR is not merge-ready in its current scope. It now contains a second, later AWQ-lite submission directory, records/track_10min_16mb/2020-04-29_AWQ_lite_mixedprecision_GPTQ, with an invalid 2020 date and a separate ML change that was not part of the accepted #1851 leaderboard row. Please split that newer AWQ-lite work into a separate PR or remove it from this one. This PR should merge only the accepted 2026-04-27 BOS-fix record package, with its 3-seed support either included directly or clearly tied to #1868.

@cocohearts cocohearts merged commit afc90a1 into openai:main Apr 30, 2026
cocohearts pushed a commit that referenced this pull request Apr 30, 2026
3-seed reproduction of PR #1851 (SmearGate BOS document boundary fix).
Code is byte-identical to #1851 by @aquariouseworkman.

Results (post-TTT BPB):
  Seed 42:   1.06128  (original #1851 author)
  Seed 314:  1.06087  (this submission)
  Seed 1234: 1.06220  (this submission)
  Mean:      1.06145 ± 0.00068

All artifacts < 16,000,000 bytes. All runs < 600s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Itssshikhar added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 30, 2026
…lone openai#1851

Part 1: BOS-fixed SmearGate + per-head attn output gate ported onto PR1493
wd_strong_paired baseline (15+/-6 lines in train_pr1493.py). 5 new env vars:
SMEARGATE_{ENABLED,BOS_ID,INIT}, ATTN_GATE_{ENABLED,INIT}.

SmearGate is causal previous-token mixing with the BOS document-boundary mask
from PR openai#1851: at positions where input_ids == bos_id, the smear contribution
is forced to zero so the final token of doc N cannot leak into BOS of doc N+1.
Verified by a focused unit test. Per-head attn_gate added inside
CausalSelfAttention applied to flash_attn output before XSA.
smeargate.smear_gate is a top-level GPT parameter so it gets explicitly
appended to Optimizers.scalar_params (not picked up by the blocks-only loop).
CONTROL_TENSOR_NAME_PATTERNS extended; 100% optimizer coverage verified.

Real-run results (single seed s42, 8xH100):

  variant                       pre        q          q_sw      q_ttt     d_qttt
  baseline (wd_strong_paired)   1.08573    1.09874    1.08194   1.07971   --
  smear+attn_gate1d (sigmoid)   1.08663    1.09887    1.08220   1.08052   +0.00081
  smearonly (gate off)          1.08601    1.09834    1.08170   1.07998   +0.00027
  smear_gate2d (additive)       killed mid-train (~step 4000, val 1.1051)

The 1D per-head sigmoid gate (8 params/layer) is undercapacity vs upstream
PR openai#1667's 96 params/layer, and is +0.00090 worse pre-quant -- a real
regression in the trained model. SmearGate alone improves q (-0.00040) and
q_sw (-0.00024) but disrupts our SGD TTT lift (0.0017 vs 0.0022 baseline);
net q_ttt within seed noise. The artifact stays >16 MB (added code costs
~7 KB; still bust like baseline).

Conclusion: port is mechanically correct, just doesn't help on PR1493 base
without the rest of the top stack (LQER, phased TTT, CaseOps).

Part 2: Critical leaderboard analysis. PR openai#1855 and PR openai#1851 are both
verified-merged by maintainer cocohearts and listed on README. PR openai#1855
has an OPEN val_docs=10_000 vs canonical 50_000 dispute (jfc43, 2026-04-30,
unresolved) that affects the entire CaseOps chain (PRs 1736/1769/1787/
1851/1855/1868). If ruling lands against, all six fall and PR1493 family
returns to the top -- so building on PR1493 is a hedged investment.

Real pre/q/q_ttt comparison vs openai#1855 seed 42 log: their pre=1.06396 vs
ours 1.08573 (+0.022 BPB gap at the trained-model level), bigger than the
total 0.020 gap. The leaderboard wedge is dominated by training-level
wins (CaseOps + SparseAttnGate + 9-knob hparam stack), not LQER/phased-TTT.

Part 3: Pivot decision. Clone openai#1851's train_gpt.py (152 KB, 3,574 lines)
as the new base rather than porting their 2,500+ lines into our 553-line
file. openai#1851 picked over openai#1855 because: same q_ttt within noise (1.06128
vs 1.06108), no lrzip system dep, fewer disputes. Layer only our small
PR1493 differentiators (paired-head Muon NS, wd_schedule, gptq_all_reduce).

CaseOps shards already published at romeerp/parameter-golf-caseops-v1
(80 train + val + val_bytes sidecar + tokenizer); saves 1-2 hr CPU
retokenization. Background download in progress at session-end.

Plan for next session: reproduce openai#1851 unmodified at s42 (target q_ttt
1.06128 +/- 0.0005); if reproduced, layer paired-head Muon then wd_schedule
one-at-a-time; if not reproduced, stop and debug.

Files added:
  pr1493_smeargate_to_top_stack_session.md   full session writeup
  _top_ref/                                  cached openai#1851 reference files
                                             (train_gpt.py, lossless_caps.py,
                                              prepare_caseops_data.py, README.md)
  run_smear_*.sh                             smear experiment runners
  run_chain_smear_experiments.sh             chain runner
  run_mom97.sh                               drafted but superseded
  logs/smear_*.txt + .stdout                 full run logs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Itssshikhar added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 30, 2026
Layers WD_SCHEDULE_ENABLED + low/high factors onto _top_ref/train_gpt.py
(PR openai#1851 SmearGate BOS Fix base). Off by default; strict no-op when
WD_SCHEDULE_ENABLED=0.

Skips paired-head Muon NS port: PR openai#1851 uses parameter banks
(qo_bank/kv_bank/mlp_*_bank stacked along dim 0) instead of per-layer
c_q/c_k weights, so the _head_pair_ns tagging approach from
train_pr1493.py does not apply without redesigning the per-bank NS path.

Surgical diff (5 hunks):
- 5 env-driven hyperparameters (WD_SCHEDULE_ENABLED, hold/ramp fracs,
  low/high factors)
- snapshot base_wd per group in Optimizers.__init__ after self.optimizers
- wd_mul(frac) helper next to lr_mul(frac), same hold/ramp shape as
  train_pr1493
- step_fn signature gains wd_scale=1.0; applies
  group["weight_decay"] = base_wd * wd_scale
- caller passes wd_mul(frac)

Run with WD_SCHEDULE_ENABLED=1 WD_SCHED_LOW_FACTOR=0.5
WD_SCHED_HIGH_FACTOR=1.75 plus the standard PR1851 env vars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Itssshikhar added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 30, 2026
Single-seed s42 result for top_wd_strong (WD_SCHEDULE_ENABLED=1, low=0.5,
high=1.75 layered onto PR openai#1851 base): q_ttt = 1.06111. Compared to PR openai#1851's
published s42 numbers (1.06128 original / 1.06083 re-run gptq8s) the delta is
within 1/2 of the published 3-seed std (0.00068) — a no-op at single-seed
resolution. Stage decomposition shows the WD schedule slightly worsened pre
(+0.00033 vs PR1855's pre 1.06396) and widened the LQER quant gap (+0.00116
vs PR1855), with phased-LoRA TTT recovering most of the q-stage damage.
Sign-flipped from PR1493 where the same WD config gave -0.00037 pre.

Includes a critical inventory of every PR1493-stack technique cross-referenced
against PR openai#1851's stack, ranking portability by pragmatic value:

1. GPTQ Hessian all-reduce: HIGH confidence, ~10-line port, expected
   -0.0005 to -0.0009 BPB. PR openai#1851's collect_hessians (line 2037-2141) does
   NOT all-reduce across ranks — same bug PR1493 had. With PR openai#1851's default
   gptq_calibration_batches=16, AR is in the regime where it helps (saturates
   at 128).
2. wd_schedule with default factors (low=0.65, high=1.5): env-var only,
   defensive test of whether WD-schedule mechanism carries at all.
3. Paired-head Muon NS port to bank architecture: ~80-120 lines of careful
   porting around qo_bank/kv_bank reshape semantics. Bank-NS already does
   per-layer NS for free, so marginal gain expected smaller than PR1493's
   -0.00055.

Honest ceiling: even with all three layered, expected q_ttt ~1.05970 — clears
PR openai#1855 by ~0.00140 BPB but does NOT clear the 0.0024-BPB acceptance bar
(0.00140 < 0.0024). Best-case submission is a non-record entry at this stack
without something architecture-level we don't have ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Itssshikhar added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 30, 2026
PR openai#1851's collect_hessians (line 2037-2150 of _top_ref/train_gpt.py) computes
each rank's Hessian on its own data shard subset (ShuffledSequenceLoader splits
files by rank) and divides only by n_calibration_batches — without all-reduce,
only rank 0's Hessian is effectively used since only rank 0 writes the
quantized blob. 7/8 of calibration compute is wasted.

Fix: dist.all_reduce(SUM) each Hessian (sorted iteration to avoid deadlock if
key order ever drifts), divide by n_calibration_batches * world_size. Smoking-
gun log line "gptq:all-rank Hessian averaging across N ranks (denom=...)"
when on, "gptq:per-rank Hessian (no all-reduce, denom=...)" when off.

Gated by GPTQ_ALL_REDUCE env var (default 1, the bugfix behavior). Off path
preserves the original upstream semantics for clean A/B if needed.

PR1493 evidence at gptq_calibration_batches=16 (PR openai#1851's default):
  16-shard no-AR: q_ttt = 1.08060
  16-shard AR  : q_ttt = 1.07977 (delta -0.00083)
At 128 calibration batches the AR delta saturates to noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Itssshikhar added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 30, 2026
…1855.py

Previous session's choice of PR openai#1851 over PR openai#1855 was a mistake we
inherited. PR openai#1855 is currently openai#1 on the upstream leaderboard at 1.06108
(3-seed mean), 0.00037 BPB ahead of PR openai#1851's 1.06145. PR openai#1855 also
ships the per-group lrzip+brotli compressor (COMPRESSOR=pergroup, ~280 KB
smaller artifact than brotli) that PR openai#1851 lacks. Without that compressor,
even the 9-hparam stack on PR openai#1851 base busts the 16 MB cap (Run 4
artifact = 16,140,607 B, +140 KB over).

train_top_1855.py = PR openai#1855's train_gpt.py + same surgical patches we
applied to train_top.py: wd_schedule (5 hparams + base_wd snapshot +
wd_mul + step_fn injection + caller) and GPTQ_ALL_REDUCE=1 in
collect_hessians. 41 line additions, 3 line modifications, syntax OK.

Run 4 evidence (PR openai#1851 + 9 hparams + wd_strong + AR, single seed s42):
  pre   = 1.06331 (vs Run 0's 1.06429 — best pre of session)
  q     = 1.07239 (q_gap 0.00908 — tightest gap of session)
  artifact = 16,140,607 B (busts cap with brotli; pergroup needed)

lrzip 0.651 installed via add-apt-repository universe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Itssshikhar added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 30, 2026
Run 4 results, single seed s42:
  pre   = 1.06331  (best pre of session, beats Run 0's 1.06429 by 0.00098)
  q     = 1.07239  (q_gap 0.00908 — tightest gap of session)
  q_ttt = 1.05950  (best q_ttt of session, beats PR openai#1855 published s42
                    1.05989 by 0.00039)
  artifact = 16,140,607 B (BUSTS 16 MB cap by 140,607 B with brotli;
                          PR openai#1855's pergroup compressor saves ~280 KB,
                          which is needed for this hparam stack to fit)

Three findings:

1. The 9 hparams transfer cleanly through to final EMA model quality.
   Contrast with paired-head Muon NS (Run 3): also gave a striking
   mid-train signal (-0.0046 at step 4000) but that gain converged out
   by pre-quant time (+0.00038 vs Run 0). Run 4's mid-train gain
   (-0.0059) carried through to pre-quant (-0.00098). Mechanism: the
   9 hparams change *what's actually being trained* (tighter clipping
   preserves outliers, longer warmdown reshapes convergence, tuned
   TTT-LoRA reshapes recovery), not just the optimizer's update
   direction.

2. Tightest quant gap of the session (0.00908). Tighter MLP/EMBED
   clipping (11.5/14.0) preserves outliers that LQER asymmetric int4
   rank-4 correction can exploit, on top of AR's narrowing.

3. Artifact busts cap with brotli alone — confirms PR openai#1855's claim
   that their pergroup compressor saves ~280 KB on this stack. With
   brotli, even PR openai#1855 itself would land ~16,180,000 B. They needed
   pergroup; we need pergroup.

This run made the case to pivot to PR openai#1855 base for Run 5. Earlier
session's choice of PR openai#1851 (yesterday's "no lrzip dispute" reasoning)
overturned by Run 4's evidence: PR openai#1855 is 0.00037 BPB ahead at
3-seed mean, ships the pergroup compressor we need to fit cap, and the
9 hparams we manually applied transfer cleanly.

Run 5 (queued, auto-launch when Run 4 GPUs free) = PR openai#1855's full env
stack + our wd_strong + AR + COMPRESSOR=pergroup. Expected q_ttt
~1.0590-1.0595 single-seed; 3-seed mean ~1.0593 ± 0.001.

Honest acceptance-bar math:
  SOTA = 1.06108 (PR openai#1855 3-seed mean)
  Bar = SOTA - 0.005 nats ≈ 1.0588
  Run 4 single = 1.05950, +0.00070 short of bar
  Run 5 predicted = 1.0590-1.0595, still 0.0002-0.0007 short

Even best-case Run 5 likely just misses the record bar by ~half a sigma.
Best plausible outcome is non-record submission with documented findings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Itssshikhar added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 30, 2026
…_ttt

Run 6 is the pergroup-recovery run from top_run4_pergroup_recovery_runbook.md:
keep Run 4's training graph (train_top.py, PR openai#1851 base) and Run 4's hparam
stack (9 PR openai#1855 overrides + wd_strong + GPTQ AR), and replace the cap-busting
brotli serialization with PR openai#1855's pergroup compressor that we ported in
commit 0209a50.

Result, single seed s42:
  pre   = 1.06335   (Run 4 was 1.06331; +0.00005)
  q     = 1.07246   (Run 4 was 1.07239; +0.00008)
  q_ttt = 1.05957   (Run 4 was 1.05950; +0.00006)
  total = 15,901,624 B (Run 4 was 16,140,607 B brotli, INVALID +140,607 B)
                       (UNDER 16,000,000 B cap by 98,376 B — VALID)

  Pergroup saves 240,863 B on the model blob and 238,983 B on total
  vs brotli on this exact stack. That matches PR openai#1855 README's published
  "~280 KB savings" claim within tolerance — different runs have different
  quantized weight distributions so brotli/pergroup deltas aren't exactly
  transportable, but the order of magnitude lines up.

Quality drift between Run 4 and Run 6 is <=0.00008 BPB across pre/q/q_ttt,
which is below typical pod-to-pod nondeterminism (Run 4 vs PR openai#1855
published s42 differed by 0.00039 even on the "same" stack). Compressor
swap is functionally a no-op on quality.

Comparison summary:
  Run 6 vs Run 4 (best, but invalid):    +0.00006 BPB worse, but VALID
  Run 6 vs Run 5 (PR openai#1855 base recovery): -0.00053 BPB BETTER and same compressor
  Run 6 vs PR openai#1855 published s42:        -0.00033 BPB better, +4365 B
  Run 6 vs PR openai#1855 3-seed mean SOTA:     -0.00152 BPB better (~1.7sigma)
  Run 6 vs acceptance bar (~1.0588):       +0.00077 BPB SHORT

So Run 6 is the strongest single-seed valid-size submission of the session.
Not yet a record (single-seed, ~half-sigma short of acceptance bar) but a
strong non-record submission with a documented win:

  - Validates the ported pergroup compressor end-to-end (synthetic 138-tensor
    roundtrip preflight + live deserialize during phased TTT eval).
  - Confirms the runbook's hypothesis that "preserve Run 4 graph + only swap
    compressor" beats "preserve compressor + retrain on PR openai#1855 base + apply
    our patches" (Run 5 path).
  - Reproduces Run 4's quality bit-equivalent within pod noise.

Pod prep this session:
  - apt-get install -y lrzip (lrzip 0.651, required by pergroup)
  - pip install brotli python-minifier
  - snapshot_download romeerp/parameter-golf-caseops-v1 (16 GB) for the
    canonical sp8192-caseops shards + canonical
    fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model SP model.
    Layout matches train_top.py's _default_caseops_data path exactly.

Files in this commit:
  top_run6_pergroup_recovery_session.md  (full Run 6 report)
  upload_run6_to_hf.py                   (pushes artifacts to HF)
  logs/top_pr1855_hparams_s42_pergroup.stdout (torchrun stdout/stderr)
  logs/top_pr1855_hparams_s42_pergroup.txt    (per-rank training log)

Artifacts pushed to HuggingFace (shikhar007/parameter-golf-gram-ns):
  models/top_pr1855_hparams_s42_pergroup.pt          (135.4 MB FP ckpt)
  models/top_pr1855_hparams_s42_pergroup.int6.ptz    (15.9 MB pergroup blob)
  logs/top_pr1855_hparams_s42_pergroup.txt
  logs/top_pr1855_hparams_s42_pergroup.stdout

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fija pushed a commit to Fija/parameter-golf that referenced this pull request May 1, 2026
…ean 1.05831 BPB

Clears record bar (1.05914) by 0.83 milli-BPB. Welch t = -6.49 vs PR openai#1855 (1.06108),
p < 0.0001. All 3 seeds produce 15.99 MB artifacts under the 16 MB cap, all under
the 600s wallclock budget.

Per-seed:
- 42:   ttt=1.05793  art=15,986,149  eval=572.6s
- 314:  ttt=1.05852  art=15,987,257  eval=553.7s
- 1234: ttt=1.05849  art=15,989,895  eval=574.1s

Submission directory at records/track_10min_16mb/2026-04-30_PR2014_Reproduction_1.0583/
contains PR openai#2014's verbatim train_gpt.py + tokenizer + our seed_results.csv + a
detailed README documenting the lineage (openai#1797 -> openai#1851 -> openai#1855 -> openai#1908 -> openai#1923
-> openai#1953 -> openai#2014), the new levers vs each parent, and the full 4-condition
C1-C4 legality check. submission.json author/github_id are placeholders pending
the user's choice of submitting account.

Reproduction script: runpod/phase_x_pr2014.sh — runs end-to-end on a single
8xH100 SXM pod (~2.5h wall, ~$66 cost).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cloud-777-boy added a commit to cloud-777-boy/parameter-golf that referenced this pull request May 1, 2026
Added a README for the non-record submission detailing the inhibitory layers on the PR openai#1851 stack, including architecture, mechanism, results, and reproduction steps.
cloud-777-boy added a commit to cloud-777-boy/parameter-golf that referenced this pull request May 1, 2026
Added README.md for non-record submission detailing inhibitory layers on PR openai#1851 stack, including mechanism, configuration, results, and limitations.
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since
2026-04-18 for whether val docs are also in the training set.

Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors:
openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps).

Verdicts:
  - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068
  - LEAK (25): openai#1736 (our research baseline) → openai#1769openai#1787openai#1797openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018openai#2118
    (current claimed frontier 1.04350), plus siblings.
  - INHERIT (1): openai#2050 (eval-only on frozen openai#1915)

Code-level evidence (not README claims):
  - Every shipped prepare_caseops_data.py is byte-identical:
    SHARD_TOKENS=10_000_000, default=10_000 for --val-docs
  - NO PR overrides --val-docs (searched all .sh files in all 34 PRs)
  - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1
    HF dataset whose manifest pins docs_val=50000, docs_train=8181945,
    sums match → CLEAN by construction
  - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description
  - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval"

Three signposts:
  - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py
    default invocation
  - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset
  - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally

The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN.
The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is
inflated by val memorization; spec 301 was designed to measure how much
remains under clean data.

Files:
  caseops-memory-leakage/README.md       — overview, methodology, takeaways
  caseops-memory-leakage/verdicts.md     — 34-row master table with evidence
  caseops-memory-leakage/family-tree.md  — ASCII trees with [C]/[L] annotations
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
…ence

After user feedback that LEAK calls relied too heavily on lineage-inheritance
and path heuristics, applied stricter criterion: a LEAK verdict requires at
least one of (a) explicit shell-script invocation of prepare_caseops_data.py
without --val-docs=50000, (b) README "Data setup" matching actual train log
path, (c) audit/submission.json admission text, (d) train log path with
`_caseops/datasets/datasets/<name>` triple-nesting OR single `<root>/datasets/<name>`
(which only local prep produces; HF always gives double-nesting).

Records that previously got LEAK by lineage-inheritance alone are now AMBIGUOUS
unless they meet at least one of those tests.

Changes:
  - openai#1945 LEAK → CLEAN  (finalize_v18.sh has snapshot_download from HF;
    actual run path matches HF target; README's prepare_caseops_data.py
    section is stale documentation)
  - openai#1953 LEAK → AMBIGUOUS  (PR ships only train_gpt.py + logs; no prep
    evidence; path matches HF target; parent openai#1945 confirmed CLEAN —
    leans CLEAN but no direct PR evidence)
  - openai#2041 LEAK → AMBIGUOUS  (no prep invocation; double-nested path
    consistent with EITHER HF or local prep)
  - openai#2075 LEAK → AMBIGUOUS  (ships prep file but no explicit invocation;
    path matches HF target)

Updated tally: CLEAN 9, LEAK 21, AMBIGUOUS 3, INHERIT 1 (was 8/25/0/1).

Headline impact: realistic clean SOTA is at most ~0.012 bpb below the
claimed frontier openai#2118 (1.04350). Best clean BPB candidates in order:
  openai#2019 1.05847 (HF, confirmed)
  openai#1953 1.05855 (AMBIGUOUS, leans CLEAN)
  openai#1945 1.05943 (HF, confirmed via re-audit)
  openai#2031 1.05985 (HF, confirmed)
  openai#1908 1.06081 (HF, confirmed)
  openai#1851 1.06128 (HF, MERGED SOTA)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants