Skip to content

Experiment: SmearGate BOS Fix + train-only logit calibration#1884

Open
someone114514 wants to merge 6 commits intoopenai:mainfrom
someone114514:smeargate-calibration-1868
Open

Experiment: SmearGate BOS Fix + train-only logit calibration#1884
someone114514 wants to merge 6 commits intoopenai:mainfrom
someone114514:smeargate-calibration-1868

Conversation

@someone114514
Copy link
Copy Markdown

Summary

Experimental variant of #1868 / #1851 SmearGate BOS Fix that adds a fixed post-GPTQ logit calibration pass.

The added calibration is deliberately small and train-only:

  • global logit temperature
  • coarse token-group bias buckets: byte length, starts-with-space, newline, digit, punctuation, alpha/case
  • no validation-derived fitting state
  • no frequency buckets by default
  • frozen after fitting, then applied before softmax in quantized diagnostic eval and phased score-first TTT

This is intended as a direct test of whether the post-GPTQ calibration signal observed locally transfers to the stronger #1868 stack.

Controls

Defaults added in this branch:

LOGIT_CALIB_ENABLED=1
LOGIT_CALIB_TOKENS=100000
LOGIT_CALIB_STRIDE=64
LOGIT_CALIB_BATCH_SEQS=8
LOGIT_CALIB_LR=0.003
LOGIT_CALIB_L2=0.01
LOGIT_CALIB_EPOCHS=1
LOGIT_CALIB_APPLY_TTT_UPDATE=1

Set LOGIT_CALIB_ENABLED=0 to recover the original #1868 behavior.

Legality / causality

Calibration is fitted only from training-token shards after GPTQ. It does not read validation targets or build validation-time state. At validation time the correction is a fixed affine transformation of logits before normal softmax, so the distribution remains normalized.

Status

No new 8xH100 score yet. This branch is prepared for a direct single-seed run against the #1868 reproduction command.

3-seed reproduction of PR openai#1851 (SmearGate BOS document boundary fix).
Code is byte-identical to openai#1851 by @aquariouseworkman.

Results (post-TTT BPB):
  Seed 42:   1.06128  (original openai#1851 author)
  Seed 314:  1.06087  (this submission)
  Seed 1234: 1.06220  (this submission)
  Mean:      1.06145 ± 0.00068

All artifacts < 16,000,000 bytes. All runs < 600s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
…ams)

After 4 parallel research agents reviewed 30+ open PRs and
compliance issues, two new findings:

1. PR openai#1923 (AsymLogit) flagged "empirical negative" by
   sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base
   with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo.
   V19's specific stack is NOT directly invalidated.

2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855
   base 1.06108 = -0.00059 BPB). Just 2 hparam env vars:
     MATRIX_LR 0.026 -> 0.028
     PHASED_TTT_PREFIX_DOCS 2500 -> 3500
   Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head).

Adds two new scout scripts:
- run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus
  + WD=2.0 (full stack, recommended first scout)
- run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0
  (ablation if V19c wins partially)

Decision rule (CaseOps val baseline 0.97651, community floor 0.0006):
  V19c < 0.97591 -> CLEAR WIN, run 3-seed
  V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b
  V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884)

Other research findings:
- PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip)
- PR openai#1929 SLOT banned per openai#1722 precedent
- PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent
- cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1
- regina-openai + Alex Zhao 48h zero activity
- CaseOps de-facto legal (PR openai#1855 merged into chain)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants