openai · someone114514 · Apr 27, 2026 · Apr 28, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/records/track_10min_16mb/2026-04-27_SmearGateBOSFix_3Seed_1.06145/README.md b/records/track_10min_16mb/2026-04-27_SmearGateBOSFix_3Seed_1.06145/README.md
@@ -0,0 +1,136 @@
+# Record: SmearGate BOS Fix — 3-Seed Reproduction of PR #1851
+
+**val_bpb = 1.06145** (3-seed mean, std 0.00068) | **~15.95 MB** | 8xH100 SXM 80GB
+
+## Summary
+
+This is a **pure reproduction study** of [PR #1851](https://github.com/openai/parameter-golf/pull/1851) by @aquariouseworkman. The training script is byte-identical to the code in PR #1851. No new techniques or modifications are introduced.
+
+PR #1851 submitted a single-seed result (seed 42, val_bpb = 1.06128). We extend this to a **3-seed evaluation** (seeds 42, 314, 1234) to confirm the result is robust and reproducible.
+
+## 3-Seed Results
+
+| Seed | Pre-Quant BPB | Quant BPB | **Post-TTT BPB** | Artifact (bytes) | Train Time | Eval Time |
+|------|---------------|-----------|-------------------|-------------------|------------|-----------|
+| 42*  | 1.06490240    | 1.07405660 | **1.06128183**   | 15,952,086        | 599.6s     | 519.5s    |
+| 314  | 1.06467893    | 1.07358634 | **1.06086831**   | 15,952,419        | 599.6s     | 525.6s    |
+| 1234 | 1.06593114    | 1.07503808 | **1.06220261**   | 15,952,690        | 599.5s     | 479.6s    |
+| **Mean ± Std** | | | **1.06145 ± 0.00068** | | | |
+
+\* Seed 42 result is from the original PR #1851 author @aquariouseworkman. Seeds 314 and 1234 are independent runs by @Christopher-Lee-McClendon.
+
+## Key Change: SmearGate BOS Document Boundary Fix
+
+PR #1851 identified and fixed a bug in the SmearGate mechanism's handling of beginning-of-sequence (BOS) document boundaries. The fix ensures SmearGate correctly resets at document boundaries instead of bleeding attention across documents.
+
+This was a targeted one-line fix on top of the PR #1787 codebase. Credit for identifying the BOS bug goes to @cocohearts; the fix implementation is by @aquariouseworkman.
+
+## Technique Stack
+
+All techniques below are inherited from PR #1851 (and its lineage). No new techniques are introduced in this reproduction.
+
+| Technique | Source | Author |
+|-----------|--------|--------|
+| Base architecture (11L, MLP 4x, MuonEq-R) | PR #1787 | @nprime06 |
+| SmearGate attention | PR #1797 | @dexhunter |
+| SmearGate BOS fix | PR #1851 | @aquariouseworkman |
+| LQER Asymmetric quantization | PR #1797 | @dexhunter |
+| CaseOps SP8192 | PR #1729 | @romeerp |
+| GPTQ + SP8192 | PR #1394 | @clarkkev |
+| Score-first TTT (3 phases) | PR #549 | @abaybektursun |
+| BOS bug identification | Issue | @cocohearts |
+
+## Architecture
+
+Same as PR #1851 / PR #1787:
+- 11 transformer layers, MLP multiplier 4x
+- SmearGate attention with BOS boundary fix
+- LQER asymmetric quantization
+- CaseOps with SP8192 tokenization
+- GPTQ post-training quantization
+- Phased test-time training (3 phases)
+- Embed clipping (15.0σ), MLP clipping (12.0σ)
+- Embed bits: 7
+
+## Compliance
+
+| Budget | Limit | Worst-Case (across seeds) | Status |
+|--------|-------|--------------------------|--------|
+| Artifact size | 16,000,000 bytes | 15,952,690 bytes | ✅ |
+| Training time | 600s | 599.6s | ✅ |
+| Eval time | 600s | 525.6s | ✅ |
+
+## Reproduction
+
+The training script is byte-identical to PR #1851. To reproduce:
+
+```bash
+# 1. Install dependencies
+pip install brotli python-minifier
+
+# 2. Prepare CaseOps SP8192 data
+#    Download the already-tokenized public CaseOps dataset from Hugging Face.
+#    Do not run prepare_caseops_data.py unless rebuilding from docs_selected.jsonl.
+python3 records/track_10min_16mb/2026-04-27_SmearGateBOSFix_3Seed_1.06145/download_caseops_data.py
+
+# 3. Run training (replace SEED with 42, 314, or 1234)
+SEED=42 \
+CASEOPS_ENABLED=1 \
+EMBED_BITS=7 \
+SMEAR_GATE_ENABLED=1 \
+SPARSE_ATTN_GATE_ENABLED=1 \
+MIN_LR=0.1 \
+EMBED_CLIP_SIGMAS=15.0 \
+MLP_CLIP_SIGMAS=12.0 \
+GPTQ_RESERVE_SECONDS=0.5 \
+PHASED_TTT_NUM_PHASES=3 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+**Environment variables (all required for exact reproduction):**
+
+| Variable | Value | Purpose |
+|----------|-------|---------|
+| `CASEOPS_ENABLED` | `1` | Enable CaseOps SP8192 tokenization |
+| `EMBED_BITS` | `7` | Embedding quantization bits |
+| `SMEAR_GATE_ENABLED` | `1` | Enable SmearGate attention |
+| `SPARSE_ATTN_GATE_ENABLED` | `1` | Enable sparse attention gating |
+| `MIN_LR` | `0.1` | Minimum learning rate |
+| `EMBED_CLIP_SIGMAS` | `15.0` | Embedding clipping threshold (σ) |
+| `MLP_CLIP_SIGMAS` | `12.0` | MLP clipping threshold (σ) |
+| `GPTQ_RESERVE_SECONDS` | `0.5` | Seconds reserved for GPTQ |
+| `PHASED_TTT_NUM_PHASES` | `3` | Number of TTT phases |
+
+**Hardware:** 8×H100 SXM 80GB (RunPod)
+
+## Credits
+
+- **@aquariouseworkman** — PR #1851 author (SmearGate BOS fix, seed 42 result)
+- **@nprime06** — PR #1787 (base architecture)
+- **@romeerp** — PR #1729 (CaseOps)
+- **@dexhunter** — PR #1797 (SmearGate + LQER asymmetric quantization)
+- **@cocohearts** — BOS document boundary bug identification
+- **@abaybektursun** — PR #549 (score-first TTT)
+- **@clarkkev** — PR #1394 (GPTQ + SP8192)
+
+### Experimental train-only logit calibration variant
+
+This branch adds optional train-only post-training experiments on top of the reproduced #1851/#1868 stack. Logit calibration fits a fixed global temperature plus coarse token-group bias using only training tokens, then applies the frozen affine correction before softmax in both the quantized diagnostic eval and the phased score-first TTT loss.
+
+Default controls:
+
+```bash
+LOGIT_CALIB_ENABLED=1
+LOGIT_CALIB_TOKENS=100000
+LOGIT_CALIB_STRIDE=64
+LOGIT_CALIB_BATCH_SEQS=8
+LOGIT_CALIB_LR=0.003
+LOGIT_CALIB_L2=0.01
+LOGIT_CALIB_TOKEN_BIAS=0
+LOGIT_CALIB_TOKEN_L2=0.05
+LOGIT_CALIB_BIAS_CLAMP=0.5
+LOGIT_CALIB_EPOCHS=1
+LOGIT_CALIB_APPLY_TTT_UPDATE=1
+```
+
+Set `LOGIT_CALIB_ENABLED=0` to recover the byte-identical #1868 behavior. The calibration pass does not read validation targets or build validation-derived state; rank 0 fits on train shard tokens and broadcasts the frozen scale/bias to all ranks before eval.
diff --git a/records/track_10min_16mb/2026-04-27_SmearGateBOSFix_3Seed_1.06145/download_caseops_data.py b/records/track_10min_16mb/2026-04-27_SmearGateBOSFix_3Seed_1.06145/download_caseops_data.py
@@ -0,0 +1,115 @@
+#!/usr/bin/env python3
+"""Download the public CaseOps SP8192 dataset used by this record.
+
+This script materializes the exact directory layout expected by train_gpt.py
+when CASEOPS_ENABLED=1:
+
+  data/datasets/fineweb10B_sp8192_caseops/datasets/tokenizers/...
+  data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_.../
+
+It downloads pre-tokenized CaseOps shards from romeerp/parameter-golf-caseops-v1.
+Do not run prepare_caseops_data.py unless rebuilding from docs_selected.jsonl.
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+import sys
+import types
+from pathlib import Path
+
+import torch
+from huggingface_hub import snapshot_download
+
+
+REPO_ID = "romeerp/parameter-golf-caseops-v1"
+REMOTE_ROOT = "datasets"
+DATASET_NAME = "fineweb10B_sp8192_lossless_caps_caseops_v1_reserved"
+TOKENIZER_NAME = "fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model"
+
+
+def build_patterns(train_shards: int) -> list[str]:
+    if train_shards < 0:
+        raise ValueError("--train-shards must be non-negative")
+    patterns = [
+        f"{REMOTE_ROOT}/manifest.json",
+        f"{REMOTE_ROOT}/tokenizers/*",
+        f"{REMOTE_ROOT}/datasets/{DATASET_NAME}/fineweb_val_*.bin",
+        f"{REMOTE_ROOT}/datasets/{DATASET_NAME}/fineweb_val_bytes_*.bin",
+    ]
+    patterns.extend(
+        f"{REMOTE_ROOT}/datasets/{DATASET_NAME}/fineweb_train_{i:06d}.bin"
+        for i in range(train_shards)
+    )
+    return patterns
+
+
+def preflight(data_dir: Path, train_shards: int) -> None:
+    root = data_dir / "datasets" / "fineweb10B_sp8192_caseops" / "datasets"
+    paths = [
+        root / "tokenizers" / TOKENIZER_NAME,
+        root / "datasets" / DATASET_NAME / "fineweb_val_000000.bin",
+        root / "datasets" / DATASET_NAME / "fineweb_val_bytes_000000.bin",
+    ]
+    if train_shards > 0:
+        paths.append(root / "datasets" / DATASET_NAME / "fineweb_train_000000.bin")
+    missing = [p for p in paths if not p.is_file()]
+    for p in paths:
+        print(("OK  " if p.is_file() else "MISS ") + str(p))
+    if missing:
+        raise FileNotFoundError("missing required CaseOps files")
+
+    os.environ["DATA_DIR"] = str(data_dir)
+    os.environ["CASEOPS_ENABLED"] = "1"
+    os.environ["VOCAB_SIZE"] = "8192"
+
+    if "flash_attn_interface" not in sys.modules:
+        mod = types.ModuleType("flash_attn_interface")
+
+        def _unused_flash_attn(*args, **kwargs):
+            raise RuntimeError("flash attention is not used during data preflight")
+
+        mod.flash_attn_func = _unused_flash_attn
+        mod.flash_attn_varlen_func = _unused_flash_attn
+        sys.modules["flash_attn_interface"] = mod
+
+    import importlib.util
+
+    train_gpt = Path(__file__).with_name("train_gpt.py")
+    spec = importlib.util.spec_from_file_location("caseops_train_gpt", train_gpt)
+    module = importlib.util.module_from_spec(spec)
+    assert spec.loader is not None
+    spec.loader.exec_module(module)
+    h = module.Hyperparameters()
+    val_data = module.ValidationData(h, torch.device("cpu"))
+    print(f"datasets_dir={h.datasets_dir}")
+    print(f"tokenizer_path={h.tokenizer_path}")
+    print(f"train_files={h.train_files}")
+    print(f"val_files={h.val_files}")
+    print(f"val_bytes_files={h.val_bytes_files}")
+    print(f"val_tokens={val_data.val_tokens.numel()}")
+    print(f"val_bytes={val_data.val_bytes.numel() if val_data.val_bytes is not None else None}")
+    print(f"sp_vocab={val_data.sp.vocab_size()}")
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-dir", default="data", type=Path)
+    parser.add_argument("--train-shards", default=80, type=int)
+    parser.add_argument("--check-only", action="store_true")
+    args = parser.parse_args()
+
+    if not args.check_only:
+        local_dir = args.data_dir / "datasets" / "fineweb10B_sp8192_caseops"
+        snapshot_download(
+            repo_id=REPO_ID,
+            repo_type="dataset",
+            local_dir=str(local_dir),
+            allow_patterns=build_patterns(args.train_shards),
+        )
+    preflight(args.data_dir, args.train_shards)
+
+
+if __name__ == "__main__":
+    main()