|
| 1 | +# Record: SmearGate BOS Fix — 3-Seed Reproduction of PR #1851 |
| 2 | + |
| 3 | +**val_bpb = 1.06145** (3-seed mean, std 0.00068) | **~15.95 MB** | 8xH100 SXM 80GB |
| 4 | + |
| 5 | +## Summary |
| 6 | + |
| 7 | +This is a **pure reproduction study** of [PR #1851](https://github.com/openai/parameter-golf/pull/1851) by @aquariouseworkman. The training script is byte-identical to the code in PR #1851. No new techniques or modifications are introduced. |
| 8 | + |
| 9 | +PR #1851 submitted a single-seed result (seed 42, val_bpb = 1.06128). We extend this to a **3-seed evaluation** (seeds 42, 314, 1234) to confirm the result is robust and reproducible. |
| 10 | + |
| 11 | +## 3-Seed Results |
| 12 | + |
| 13 | +| Seed | Pre-Quant BPB | Quant BPB | **Post-TTT BPB** | Artifact (bytes) | Train Time | Eval Time | |
| 14 | +|------|---------------|-----------|-------------------|-------------------|------------|-----------| |
| 15 | +| 42* | 1.06490240 | 1.07405660 | **1.06128183** | 15,952,086 | 599.6s | 519.5s | |
| 16 | +| 314 | 1.06467893 | 1.07358634 | **1.06086831** | 15,952,419 | 599.6s | 525.6s | |
| 17 | +| 1234 | 1.06593114 | 1.07503808 | **1.06220261** | 15,952,690 | 599.5s | 479.6s | |
| 18 | +| **Mean ± Std** | | | **1.06145 ± 0.00068** | | | | |
| 19 | + |
| 20 | +\* Seed 42 result is from the original PR #1851 author @aquariouseworkman. Seeds 314 and 1234 are independent runs by @Christopher-Lee-McClendon. |
| 21 | + |
| 22 | +## Key Change: SmearGate BOS Document Boundary Fix |
| 23 | + |
| 24 | +PR #1851 identified and fixed a bug in the SmearGate mechanism's handling of beginning-of-sequence (BOS) document boundaries. The fix ensures SmearGate correctly resets at document boundaries instead of bleeding attention across documents. |
| 25 | + |
| 26 | +This was a targeted one-line fix on top of the PR #1787 codebase. Credit for identifying the BOS bug goes to @cocohearts; the fix implementation is by @aquariouseworkman. |
| 27 | + |
| 28 | +## Technique Stack |
| 29 | + |
| 30 | +All techniques below are inherited from PR #1851 (and its lineage). No new techniques are introduced in this reproduction. |
| 31 | + |
| 32 | +| Technique | Source | Author | |
| 33 | +|-----------|--------|--------| |
| 34 | +| Base architecture (11L, MLP 4x, MuonEq-R) | PR #1787 | @nprime06 | |
| 35 | +| SmearGate attention | PR #1797 | @dexhunter | |
| 36 | +| SmearGate BOS fix | PR #1851 | @aquariouseworkman | |
| 37 | +| LQER Asymmetric quantization | PR #1797 | @dexhunter | |
| 38 | +| CaseOps SP8192 | PR #1729 | @romeerp | |
| 39 | +| GPTQ + SP8192 | PR #1394 | @clarkkev | |
| 40 | +| Score-first TTT (3 phases) | PR #549 | @abaybektursun | |
| 41 | +| BOS bug identification | Issue | @cocohearts | |
| 42 | + |
| 43 | +## Architecture |
| 44 | + |
| 45 | +Same as PR #1851 / PR #1787: |
| 46 | +- 11 transformer layers, MLP multiplier 4x |
| 47 | +- SmearGate attention with BOS boundary fix |
| 48 | +- LQER asymmetric quantization |
| 49 | +- CaseOps with SP8192 tokenization |
| 50 | +- GPTQ post-training quantization |
| 51 | +- Phased test-time training (3 phases) |
| 52 | +- Embed clipping (15.0σ), MLP clipping (12.0σ) |
| 53 | +- Embed bits: 7 |
| 54 | + |
| 55 | +## Compliance |
| 56 | + |
| 57 | +| Budget | Limit | Worst-Case (across seeds) | Status | |
| 58 | +|--------|-------|--------------------------|--------| |
| 59 | +| Artifact size | 16,000,000 bytes | 15,952,690 bytes | ✅ | |
| 60 | +| Training time | 600s | 599.6s | ✅ | |
| 61 | +| Eval time | 600s | 525.6s | ✅ | |
| 62 | + |
| 63 | +## Reproduction |
| 64 | + |
| 65 | +The training script is byte-identical to PR #1851. To reproduce: |
| 66 | + |
| 67 | +```bash |
| 68 | +# 1. Install dependencies |
| 69 | +pip install brotli python-minifier |
| 70 | + |
| 71 | +# 2. Prepare CaseOps SP8192 data |
| 72 | +# Option A: Download pre-tokenized CaseOps data from HuggingFace |
| 73 | +python3 prepare_caseops_data.py # downloads from romeerp/parameter-golf-caseops-v1 |
| 74 | +# Option B: Or use the standard data script |
| 75 | +MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192 --skip-manifest |
| 76 | +# Then apply CaseOps transform: |
| 77 | +python3 lossless_caps.py # transforms shards with CaseOps encoding |
| 78 | + |
| 79 | +# 3. Run training (replace SEED with 42, 314, or 1234) |
| 80 | +SEED=42 \ |
| 81 | +CASEOPS_ENABLED=1 \ |
| 82 | +EMBED_BITS=7 \ |
| 83 | +SMEAR_GATE_ENABLED=1 \ |
| 84 | +SPARSE_ATTN_GATE_ENABLED=1 \ |
| 85 | +MIN_LR=0.1 \ |
| 86 | +EMBED_CLIP_SIGMAS=15.0 \ |
| 87 | +MLP_CLIP_SIGMAS=12.0 \ |
| 88 | +GPTQ_RESERVE_SECONDS=0.5 \ |
| 89 | +PHASED_TTT_NUM_PHASES=3 \ |
| 90 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 91 | +``` |
| 92 | + |
| 93 | +**Environment variables (all required for exact reproduction):** |
| 94 | + |
| 95 | +| Variable | Value | Purpose | |
| 96 | +|----------|-------|---------| |
| 97 | +| `CASEOPS_ENABLED` | `1` | Enable CaseOps SP8192 tokenization | |
| 98 | +| `EMBED_BITS` | `7` | Embedding quantization bits | |
| 99 | +| `SMEAR_GATE_ENABLED` | `1` | Enable SmearGate attention | |
| 100 | +| `SPARSE_ATTN_GATE_ENABLED` | `1` | Enable sparse attention gating | |
| 101 | +| `MIN_LR` | `0.1` | Minimum learning rate | |
| 102 | +| `EMBED_CLIP_SIGMAS` | `15.0` | Embedding clipping threshold (σ) | |
| 103 | +| `MLP_CLIP_SIGMAS` | `12.0` | MLP clipping threshold (σ) | |
| 104 | +| `GPTQ_RESERVE_SECONDS` | `0.5` | Seconds reserved for GPTQ | |
| 105 | +| `PHASED_TTT_NUM_PHASES` | `3` | Number of TTT phases | |
| 106 | + |
| 107 | +**Hardware:** 8×H100 SXM 80GB (RunPod) |
| 108 | + |
| 109 | +## Credits |
| 110 | + |
| 111 | +- **@aquariouseworkman** — PR #1851 author (SmearGate BOS fix, seed 42 result) |
| 112 | +- **@nprime06** — PR #1787 (base architecture) |
| 113 | +- **@romeerp** — PR #1729 (CaseOps) |
| 114 | +- **@dexhunter** — PR #1797 (SmearGate + LQER asymmetric quantization) |
| 115 | +- **@cocohearts** — BOS document boundary bug identification |
| 116 | +- **@abaybektursun** — PR #549 (score-first TTT) |
| 117 | +- **@clarkkev** — PR #1394 (GPTQ + SP8192) |
0 commit comments