Skip to content

Commit 1583e60

Browse files
Record: PR openai#1854 neural stack, budget-compliant 3-seed reproduction — val_bpb 1.06777 (3-seed mean)
3-seed validated reproduction of PR openai#1854's neural stack with PHASED_TTT_PREFIX_DOCS=1500 to fit the 600s eval budget. Beats merged SOTA PR openai#1493 (bigbag, 1.0810) by 0.01323 BPB at ~13σ statistical significance. Reported val_bpb is the standard token-level NLL → byte conversion (no byte-PPM mixture claimed). The exploratory multibin-λ refinement of PR openai#1835's mixer is included in train_gpt.py for completeness but its mix_bpb is not the headline claim, due to an open community question on byte-spread normalization vs Kraft compliance.
1 parent 7427de2 commit 1583e60

10 files changed

Lines changed: 6216 additions & 0 deletions

File tree

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Record: PR #1854 neural stack, budget-compliant 3-seed reproduction — val_bpb 1.06777 (3-seed mean)
2+
3+
**val_bpb: 1.06777** (3-seed mean, std 0.00106) | **15,951,074 bytes** (mean) | 8×H100 SXM, ≤600s train / ≤600s eval
4+
5+
This submission is a **3-seed validated, eval-budget-compliant reproduction** of the neural-stack portion of [PR #1854](https://github.com/openai/parameter-golf/pull/1854) (ndokutovich), with `PHASED_TTT_PREFIX_DOCS` reduced from 2000 → 1500 to fit the 600s evaluation budget cleanly. **Beats merged-leaderboard SOTA PR #1493 (bigbag, 1.0810) by 0.01323 BPB** at ~13σ statistical significance.
6+
7+
The reported `val_bpb` is the **standard token-level NLL → byte conversion** (`val_loss / log(2) · tokens/bytes`) for the post-quantization, post-Phased-TTT model. **No byte-level mixture is claimed** — see "Note on byte-PPM mixture" below.
8+
9+
## Result (3 seeds, 8×H100 80GB SXM)
10+
11+
| Seed | val_bpb | Total bytes | Eval time |
12+
|------|------:|------:|------:|
13+
| 42 | 1.06686 | 15,952,086 | 374.6s |
14+
| 1337 | 1.06893 | 15,949,941 | 371.0s |
15+
| 314 | 1.06752 | 15,951,195 | 327.7s |
16+
| **Mean** | **1.06777** | **15,951,074** | **357.8s** |
17+
| **Std** | **0.00106** | | |
18+
19+
- **vs merged leaderboard PR #1493 @bigbag (1.0810)**: **−0.01323 BPB** (12.5× std clearance over the 0.005-nat threshold; p ≪ 0.0001)
20+
- All 3 artifacts under 16,000,000 bytes (max 15,952,086, margin 47,914)
21+
- All 3 eval times under 600s wallclock (max 374.6s, margin 225.4s)
22+
23+
## What's new vs PR #1854
24+
25+
PR #1854's reported eval wallclock is **~700s** (per its own log breakdown: ttt_phased 516s + ppm_mix 116s + diagnostics 67s), which is over the 600s evaluation budget. This submission demonstrates that the same neural stack achieves **identical post-TTT val_bpb (~1.067)** while fitting cleanly under the 600s eval budget by reducing `PHASED_TTT_PREFIX_DOCS` from 2000 to 1500. Statistical evidence: 3-seed std of 0.00106 confirms the val_bpb is stable at this trim.
26+
27+
This decoupling matters because the 600s eval budget is an explicit contest constraint (closed PRs in Issue #677 cite eval over-budget as grounds for rejection — see PR #503 closure). A budget-compliant 1.067 record is a more defensible record candidate than a slightly lower over-budget one.
28+
29+
## Note on byte-PPM mixture (not claimed)
30+
31+
`train_gpt.py` includes our exploratory multibin-λ refinement of the byte-level PPM-D mixer (a 4-tier graduated gate: `[(0.95, 0.02), (0.85, 0.10), (0.75, 0.40), (0.0, 1.0)]`). When run with `PPM_ENABLED=1`, it produces `mix_bpb ≈ 0.861` on the val byte stream — a 0.04 BPB delta below the `quantized_ttt_phased val_bpb`.
32+
33+
We **do not claim `mix_bpb`** in this submission. The byte-PPM mixture relies on a per-byte spreading approximation `per_byte_logp = token_logp / n_bytes` whose normalization properties over the 256-byte alphabet are an open community question. Specifically: under the assumption that the NN's byte-level distribution is the geometric mean of the token's per-byte spread, the convex combination with PPM-D's normalized byte distribution is not unambiguously a Kraft-compliant prefix code. Until that interpretation is settled by the maintainers, we report only the standard `val_bpb` derived from token-level NLL, which has no such ambiguity.
34+
35+
The multibin mixer code is left in `train_gpt.py` so this submission is a single self-contained reproducible artifact. Setting `PPM_ENABLED=0` in the reproduction command produces only the standard `val_bpb` line and skips the mixer entirely.
36+
37+
## Issue #1017 four-condition compliance (for the standard val_bpb path)
38+
39+
| Condition | How this submission satisfies it |
40+
|---|---|
41+
| **C1 Causality** | Standard sliding-window eval; each token scored from prefix only. Phased TTT is score-first per PR #1413's protocol. |
42+
| **C2 Normalized** | The model's softmax over the SP8192 token vocabulary is a proper normalized distribution. The reported `val_bpb` is `(total token NLL) / log(2) / total bytes`, the standard token-level codelength normalized by byte count. |
43+
| **C3 Score-before-update** | Phased TTT scores each chunk under `torch.no_grad()` before any optimizer step. |
44+
| **C4 Single pass** | One left-to-right pass; no rescoring or oracle selection. |
45+
46+
Additional compliance:
47+
- **No SLOT, no token-level n-gram cache, no logit bias.** Inherited from PR #1854's stack.
48+
- **No pre-quant TTT on val data.** Score-first phased TTT only (PR #1413 lineage).
49+
- **No external network access** at eval time. Tokenizer unchanged from PR #1854's CaseOps SP8192.
50+
51+
## Compute & artifact compliance
52+
53+
| Item | Value | Limit | Margin |
54+
|---|--:|--:|--:|
55+
| Training wallclock | 600s (cap-bound, all 3 seeds) | 600s | 0 (by cap) |
56+
| Evaluation wallclock (max seed) | 374.6s | 600s | 225.4s |
57+
| Artifact total bytes (max seed) | 15,952,086 | 16,000,000 | 47,914 |
58+
| Code (uncompressed) | 161,565 |||
59+
| Code (pyminify + lzma) | 33,305 |||
60+
| Quantized model (int6 + brotli) | 15,918,781 |||
61+
62+
## Lineage and credits
63+
64+
- **PR #549** (@abaybektursun) — score-first TTT framework
65+
- **PR #1394** (@clarkkev) — SP8192 + multi-phase score-first TTT + GPTQ embeddings + SDClip
66+
- **PR #1413** (@dexhunter) — legal score-first TTT with QK-Gain
67+
- **PR #1493** (@bigbag) — 3-layer recurrence + parallel residuals + QK-Gain 5.25 (current merged leaderboard)
68+
- **PR #1729** (@romeerp) — CaseOps bijective case transform
69+
- **PR #1787** (@nprime06) — SparseAttnGate + PolarNS + MIN_LR + FusedCE
70+
- **PR #1797** (@dexhunter) — Smear gate + LQER asymmetric (this submission's neural base)
71+
- **PR #1854** (@ndokutovich) — PR #1797 base + PR #1835 PPM-D port (this submission's direct predecessor's neural stack)
72+
- **PR #1835** (@anmarhindi) — original byte-level PPM-D mixture (we include a multibin-gate refinement in code, but do not claim its score)
73+
74+
This submission's contribution is twofold:
75+
1. **Eval-budget-compliant 3-seed reproduction** of PR #1854's neural stack (val_bpb 1.06777 mean, std 0.00106) with `PHASED_TTT_PREFIX_DOCS=1500`, fitting cleanly under the 600s eval cap.
76+
2. **Multibin-λ refinement** of the PR #1835 PPM-D mixer (included in code, runs at `PPM_ENABLED=1`). We document its measured `mix_bpb` of ~0.861 on the val byte stream but do not claim it as the headline `val_bpb` due to the byte-spread normalization question.
77+
78+
## Reproduction
79+
80+
### Data prep (run once, ~30 min on CPU pod)
81+
82+
```bash
83+
python3 -c "from huggingface_hub import hf_hub_download; \
84+
hf_hub_download(repo_id='willdepueoai/parameter-golf', \
85+
filename='datasets/docs_selected.jsonl', \
86+
repo_type='dataset', local_dir='./hf_cache')"
87+
88+
python3 prepare_caseops_data.py \
89+
--docs ./hf_cache/datasets/docs_selected.jsonl \
90+
--out ./data/datasets/fineweb10B_sp8192_caseops/datasets \
91+
--sp ./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
92+
--workers 16 --max-docs 5000000
93+
```
94+
95+
### 3-seed training + eval (~$54 on RunPod 8×H100 SXM)
96+
97+
```bash
98+
DATA_PATH=./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved
99+
TOKENIZER_PATH=./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model
100+
101+
for SEED in 42 1337 314; do
102+
CASEOPS_ENABLED=1 \
103+
PHASED_TTT_PREFIX_DOCS=1500 PHASED_TTT_NUM_PHASES=3 \
104+
MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=12.0 \
105+
EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
106+
MATRIX_LR=0.026 MIN_LR=0.1 \
107+
FUSED_CE_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 \
108+
SMEAR_GATE_ENABLED=1 GATE_WINDOW=12 \
109+
LQER_ENABLED=1 LQER_RANK=4 LQER_TOP_K=3 LQER_FACTOR_BITS=4 \
110+
LQER_ASYM_ENABLED=1 LQER_ASYM_GROUP=64 \
111+
TTT_WARM_START_A=1 \
112+
GPTQ_RESERVE_SECONDS=0.5 GPTQ_CALIBRATION_BATCHES=16 \
113+
PPM_ENABLED=1 PPM_ORDER=5 PPM_SUBSET_TOKENS=4000000 \
114+
DATA_PATH="$DATA_PATH" TOKENIZER_PATH="$TOKENIZER_PATH" \
115+
SEED=$SEED \
116+
torchrun --standalone --nproc_per_node=8 train_gpt.py > train_seed${SEED}.log 2>&1
117+
done
118+
```
119+
120+
The headline `val_bpb` for each run is logged as the `quantized_ttt_phased val_bpb:` field. The same logs also include the exploratory `mix_bpb` from the multibin mixer; that is not claimed.
121+
122+
To reproduce **only** the headline `val_bpb` (skipping the mixer entirely), set `PPM_ENABLED=0` in the env block above.
123+
124+
## Files
125+
126+
- `README.md` — this file
127+
- `submission.json` — metadata
128+
- `train_gpt.py` — PR #1854's neural stack with the multibin mixer addition (claimed `val_bpb` is the standard, mixer-independent path)
129+
- `lossless_caps.py` — verbatim from PR #1854
130+
- `prepare_caseops_data.py` — verbatim from PR #1854
131+
- `tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model` — verbatim
132+
- `train_seed{42,1337,314}.log` — per-seed train+eval logs
133+
- `final_model.int6.ptz` — quantized model artifact (best seed)
Binary file not shown.

0 commit comments

Comments
 (0)