|
| 1 | +# Record: PR #1854 neural stack, budget-compliant 3-seed reproduction — val_bpb 1.06777 (3-seed mean) |
| 2 | + |
| 3 | +**val_bpb: 1.06777** (3-seed mean, std 0.00106) | **15,951,074 bytes** (mean) | 8×H100 SXM, ≤600s train / ≤600s eval |
| 4 | + |
| 5 | +This submission is a **3-seed validated, eval-budget-compliant reproduction** of the neural-stack portion of [PR #1854](https://github.com/openai/parameter-golf/pull/1854) (ndokutovich), with `PHASED_TTT_PREFIX_DOCS` reduced from 2000 → 1500 to fit the 600s evaluation budget cleanly. **Beats merged-leaderboard SOTA PR #1493 (bigbag, 1.0810) by 0.01323 BPB** at ~13σ statistical significance. |
| 6 | + |
| 7 | +The reported `val_bpb` is the **standard token-level NLL → byte conversion** (`val_loss / log(2) · tokens/bytes`) for the post-quantization, post-Phased-TTT model. **No byte-level mixture is claimed** — see "Note on byte-PPM mixture" below. |
| 8 | + |
| 9 | +## Result (3 seeds, 8×H100 80GB SXM) |
| 10 | + |
| 11 | +| Seed | val_bpb | Total bytes | Eval time | |
| 12 | +|------|------:|------:|------:| |
| 13 | +| 42 | 1.06686 | 15,952,086 | 374.6s | |
| 14 | +| 1337 | 1.06893 | 15,949,941 | 371.0s | |
| 15 | +| 314 | 1.06752 | 15,951,195 | 327.7s | |
| 16 | +| **Mean** | **1.06777** | **15,951,074** | **357.8s** | |
| 17 | +| **Std** | **0.00106** | | | |
| 18 | + |
| 19 | +- **vs merged leaderboard PR #1493 @bigbag (1.0810)**: **−0.01323 BPB** (12.5× std clearance over the 0.005-nat threshold; p ≪ 0.0001) |
| 20 | +- All 3 artifacts under 16,000,000 bytes (max 15,952,086, margin 47,914) |
| 21 | +- All 3 eval times under 600s wallclock (max 374.6s, margin 225.4s) |
| 22 | + |
| 23 | +## What's new vs PR #1854 |
| 24 | + |
| 25 | +PR #1854's reported eval wallclock is **~700s** (per its own log breakdown: ttt_phased 516s + ppm_mix 116s + diagnostics 67s), which is over the 600s evaluation budget. This submission demonstrates that the same neural stack achieves **identical post-TTT val_bpb (~1.067)** while fitting cleanly under the 600s eval budget by reducing `PHASED_TTT_PREFIX_DOCS` from 2000 to 1500. Statistical evidence: 3-seed std of 0.00106 confirms the val_bpb is stable at this trim. |
| 26 | + |
| 27 | +This decoupling matters because the 600s eval budget is an explicit contest constraint (closed PRs in Issue #677 cite eval over-budget as grounds for rejection — see PR #503 closure). A budget-compliant 1.067 record is a more defensible record candidate than a slightly lower over-budget one. |
| 28 | + |
| 29 | +## Note on byte-PPM mixture (not claimed) |
| 30 | + |
| 31 | +`train_gpt.py` includes our exploratory multibin-λ refinement of the byte-level PPM-D mixer (a 4-tier graduated gate: `[(0.95, 0.02), (0.85, 0.10), (0.75, 0.40), (0.0, 1.0)]`). When run with `PPM_ENABLED=1`, it produces `mix_bpb ≈ 0.861` on the val byte stream — a 0.04 BPB delta below the `quantized_ttt_phased val_bpb`. |
| 32 | + |
| 33 | +We **do not claim `mix_bpb`** in this submission. The byte-PPM mixture relies on a per-byte spreading approximation `per_byte_logp = token_logp / n_bytes` whose normalization properties over the 256-byte alphabet are an open community question. Specifically: under the assumption that the NN's byte-level distribution is the geometric mean of the token's per-byte spread, the convex combination with PPM-D's normalized byte distribution is not unambiguously a Kraft-compliant prefix code. Until that interpretation is settled by the maintainers, we report only the standard `val_bpb` derived from token-level NLL, which has no such ambiguity. |
| 34 | + |
| 35 | +The multibin mixer code is left in `train_gpt.py` so this submission is a single self-contained reproducible artifact. Setting `PPM_ENABLED=0` in the reproduction command produces only the standard `val_bpb` line and skips the mixer entirely. |
| 36 | + |
| 37 | +## Issue #1017 four-condition compliance (for the standard val_bpb path) |
| 38 | + |
| 39 | +| Condition | How this submission satisfies it | |
| 40 | +|---|---| |
| 41 | +| **C1 Causality** | Standard sliding-window eval; each token scored from prefix only. Phased TTT is score-first per PR #1413's protocol. | |
| 42 | +| **C2 Normalized** | The model's softmax over the SP8192 token vocabulary is a proper normalized distribution. The reported `val_bpb` is `(total token NLL) / log(2) / total bytes`, the standard token-level codelength normalized by byte count. | |
| 43 | +| **C3 Score-before-update** | Phased TTT scores each chunk under `torch.no_grad()` before any optimizer step. | |
| 44 | +| **C4 Single pass** | One left-to-right pass; no rescoring or oracle selection. | |
| 45 | + |
| 46 | +Additional compliance: |
| 47 | +- **No SLOT, no token-level n-gram cache, no logit bias.** Inherited from PR #1854's stack. |
| 48 | +- **No pre-quant TTT on val data.** Score-first phased TTT only (PR #1413 lineage). |
| 49 | +- **No external network access** at eval time. Tokenizer unchanged from PR #1854's CaseOps SP8192. |
| 50 | + |
| 51 | +## Compute & artifact compliance |
| 52 | + |
| 53 | +| Item | Value | Limit | Margin | |
| 54 | +|---|--:|--:|--:| |
| 55 | +| Training wallclock | 600s (cap-bound, all 3 seeds) | 600s | 0 (by cap) | |
| 56 | +| Evaluation wallclock (max seed) | 374.6s | 600s | 225.4s | |
| 57 | +| Artifact total bytes (max seed) | 15,952,086 | 16,000,000 | 47,914 | |
| 58 | +| Code (uncompressed) | 161,565 | — | — | |
| 59 | +| Code (pyminify + lzma) | 33,305 | — | — | |
| 60 | +| Quantized model (int6 + brotli) | 15,918,781 | — | — | |
| 61 | + |
| 62 | +## Lineage and credits |
| 63 | + |
| 64 | +- **PR #549** (@abaybektursun) — score-first TTT framework |
| 65 | +- **PR #1394** (@clarkkev) — SP8192 + multi-phase score-first TTT + GPTQ embeddings + SDClip |
| 66 | +- **PR #1413** (@dexhunter) — legal score-first TTT with QK-Gain |
| 67 | +- **PR #1493** (@bigbag) — 3-layer recurrence + parallel residuals + QK-Gain 5.25 (current merged leaderboard) |
| 68 | +- **PR #1729** (@romeerp) — CaseOps bijective case transform |
| 69 | +- **PR #1787** (@nprime06) — SparseAttnGate + PolarNS + MIN_LR + FusedCE |
| 70 | +- **PR #1797** (@dexhunter) — Smear gate + LQER asymmetric (this submission's neural base) |
| 71 | +- **PR #1854** (@ndokutovich) — PR #1797 base + PR #1835 PPM-D port (this submission's direct predecessor's neural stack) |
| 72 | +- **PR #1835** (@anmarhindi) — original byte-level PPM-D mixture (we include a multibin-gate refinement in code, but do not claim its score) |
| 73 | + |
| 74 | +This submission's contribution is twofold: |
| 75 | +1. **Eval-budget-compliant 3-seed reproduction** of PR #1854's neural stack (val_bpb 1.06777 mean, std 0.00106) with `PHASED_TTT_PREFIX_DOCS=1500`, fitting cleanly under the 600s eval cap. |
| 76 | +2. **Multibin-λ refinement** of the PR #1835 PPM-D mixer (included in code, runs at `PPM_ENABLED=1`). We document its measured `mix_bpb` of ~0.861 on the val byte stream but do not claim it as the headline `val_bpb` due to the byte-spread normalization question. |
| 77 | + |
| 78 | +## Reproduction |
| 79 | + |
| 80 | +### Data prep (run once, ~30 min on CPU pod) |
| 81 | + |
| 82 | +```bash |
| 83 | +python3 -c "from huggingface_hub import hf_hub_download; \ |
| 84 | + hf_hub_download(repo_id='willdepueoai/parameter-golf', \ |
| 85 | + filename='datasets/docs_selected.jsonl', \ |
| 86 | + repo_type='dataset', local_dir='./hf_cache')" |
| 87 | + |
| 88 | +python3 prepare_caseops_data.py \ |
| 89 | + --docs ./hf_cache/datasets/docs_selected.jsonl \ |
| 90 | + --out ./data/datasets/fineweb10B_sp8192_caseops/datasets \ |
| 91 | + --sp ./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \ |
| 92 | + --workers 16 --max-docs 5000000 |
| 93 | +``` |
| 94 | + |
| 95 | +### 3-seed training + eval (~$54 on RunPod 8×H100 SXM) |
| 96 | + |
| 97 | +```bash |
| 98 | +DATA_PATH=./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved |
| 99 | +TOKENIZER_PATH=./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model |
| 100 | + |
| 101 | +for SEED in 42 1337 314; do |
| 102 | + CASEOPS_ENABLED=1 \ |
| 103 | + PHASED_TTT_PREFIX_DOCS=1500 PHASED_TTT_NUM_PHASES=3 \ |
| 104 | + MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=12.0 \ |
| 105 | + EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \ |
| 106 | + MATRIX_LR=0.026 MIN_LR=0.1 \ |
| 107 | + FUSED_CE_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 \ |
| 108 | + SMEAR_GATE_ENABLED=1 GATE_WINDOW=12 \ |
| 109 | + LQER_ENABLED=1 LQER_RANK=4 LQER_TOP_K=3 LQER_FACTOR_BITS=4 \ |
| 110 | + LQER_ASYM_ENABLED=1 LQER_ASYM_GROUP=64 \ |
| 111 | + TTT_WARM_START_A=1 \ |
| 112 | + GPTQ_RESERVE_SECONDS=0.5 GPTQ_CALIBRATION_BATCHES=16 \ |
| 113 | + PPM_ENABLED=1 PPM_ORDER=5 PPM_SUBSET_TOKENS=4000000 \ |
| 114 | + DATA_PATH="$DATA_PATH" TOKENIZER_PATH="$TOKENIZER_PATH" \ |
| 115 | + SEED=$SEED \ |
| 116 | + torchrun --standalone --nproc_per_node=8 train_gpt.py > train_seed${SEED}.log 2>&1 |
| 117 | +done |
| 118 | +``` |
| 119 | + |
| 120 | +The headline `val_bpb` for each run is logged as the `quantized_ttt_phased val_bpb:` field. The same logs also include the exploratory `mix_bpb` from the multibin mixer; that is not claimed. |
| 121 | + |
| 122 | +To reproduce **only** the headline `val_bpb` (skipping the mixer entirely), set `PPM_ENABLED=0` in the env block above. |
| 123 | + |
| 124 | +## Files |
| 125 | + |
| 126 | +- `README.md` — this file |
| 127 | +- `submission.json` — metadata |
| 128 | +- `train_gpt.py` — PR #1854's neural stack with the multibin mixer addition (claimed `val_bpb` is the standard, mixer-independent path) |
| 129 | +- `lossless_caps.py` — verbatim from PR #1854 |
| 130 | +- `prepare_caseops_data.py` — verbatim from PR #1854 |
| 131 | +- `tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model` — verbatim |
| 132 | +- `train_seed{42,1337,314}.log` — per-seed train+eval logs |
| 133 | +- `final_model.int6.ptz` — quantized model artifact (best seed) |
0 commit comments