Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures by Christopher-Lee-McClendon · Pull Request #1916 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-04-29T03:41:15Z

Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures

Key Results

Metric	Value	Status
Full-val Path B mixture BPB	1.5221 (151,078,222 bytes)	✅ Audited, `claim_ready=true`
Full-val Path B neural-only BPB	1.5430	✅ Audited
Full-val Path B PPM-D-only BPB	2.0319	✅ Audited
Neural-only sliding BPB (token-level)	1.0830	✅ Valid
Neural-only TTT BPB (token-level)	1.0812	✅ Valid
Invalid in-source PPM mixture	0.9949	❌ Invalid — see formal proof

PPM-D mixture gain: −0.021 BPB over neural-only byte-level baseline (1.5430 → 1.5221). Note: This result is sliding-window only; no corrected Path B TTT result exists yet.

This is a non-record methodology submission documenting a rigorous framework for computing valid byte-level BPB with neural + PPM-D mixtures. The headline result is a full-validation, provably normalized, audited byte-level mixture BPB of 1.5221 — the first such audited number in the contest.

What's new in this update

Full-validation Path B result (previously only had 8M-token subset):
- mixture_bpb = 1.5221 (was 1.5459 on 8M subset)
- neural_only_bpb = 1.5430 (was 1.5619 on 8M subset)
- PPM-D improves by −0.021 BPB on full val (vs −0.016 on subset)
- Runtime: ~9.4 hours offline CPU postpass
Comparison with PR Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain #1905 by @leon2k2k2k:
- Both independently discovered the same normalization invalidity
- Our Path B mixture improves (−0.021 BPB); PR Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain #1905's conditional mixture worsens (+0.038 BPP)
- Key differences: we use PPM-D with exclusion (theirs lacks exclusion), different gating direction
Formal mathematical description of byte-level vs token-level BPB metrics
Path A archived as computationally intractable (O(V=8192) per position)
Score-First Legal TTT evidence section citing PRs Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461, Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549, Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735, Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851, Record: SmearGate BOS Fix 3-Seed Compliance Re-run — val_bpb 1.06141 (3-seed mean) #1868, Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean) #1876

Formal Math

Token-Level BPB (standard contest metric)

$$\text{BPB}_{\text{token}} = \frac{\sum_{t=1}^{T} -\log p_{\text{NN}}(v_t \mid v_{\lt t})}{\log 2 \cdot B}$$

Byte-Level Neural BPB (Path B — trie marginalization over continuable mass)

$$p_{\text{NN}}^{\text{byte}}(b_k \mid \pi_{\lt k}) = \frac{\sum_{v \in C(\pi_{\lt k} b_k)} p_{\text{NN}}(v \mid v_{\lt t})}{\sum_{v \in C(\pi_{\lt k})} p_{\text{NN}}(v \mid v_{\lt t})}$$

where $C(\pi)$ denotes tokens whose byte sequences strictly extend $\pi$ (excluding tokens terminating exactly at $\pi$). This is a proper conditional distribution by construction.

Mixture

$$p_{\text{mix}}(b) = (1 - \lambda) \cdot p_{\text{NN}}^{\text{byte}}(b) + \lambda \cdot p_{\text{PPM}}(b)$$

$$\lambda = \begin{cases} 0.90 & \text{if } \max_b p_{\text{PPM}}(b) \geq 0.90 \ 0.05 & \text{otherwise} \end{cases}$$

Full-val BPB

$$\text{BPB}_{\text{byte}} = \frac{\sum_{i=1}^{B} -\log_2 p_{\text{mix}}(b_i \mid h_{\lt i})}{B} = \frac{\text{total NLL bits}}{151{,}078{,}222} = 1.5221$$

Why neural_only_bpb (1.5430) ≠ token-level sliding BPB (1.0830)

Token-level BPB distributes each token's cross-entropy uniformly across its bytes. Byte-level BPB asks "given the byte prefix emitted so far, what is the conditional probability of the next byte?" — a fundamentally harder task requiring the model to resolve within-token byte ambiguity.

Comparison with PR #1905

Both this submission and PR #1905 (by @leon2k2k2k) independently discovered the same normalization invalidity in geometric-mean byte decomposition. Both implement correct trie-based conditional byte distributions. Yet the mixture effect diverges:

Aspect	This submission (Path B)	PR #1905
PPM-D variant	With exclusion (order 5)	Without exclusion
Confidence gating	PPM-D confidence-based (both use PPM-D side)	PPM-D confidence-based
Mixture effect on own byte-level baseline	−0.021 BPB (improvement)	+0.038 BPB (degradation)
PPM-D helps?	Yes	No (hurts)

Note: The raw neural baselines (1.5430 vs 1.08335) are not directly comparable — ours is byte-level, theirs appears to be token-level. The meaningful comparison is the direction of the mixture effect on each submission's own baseline.

Both independently confirmed: uniform-spread (geometric mean) byte decomposition is NOT a valid probability distribution (sums > 1). The key difference is PPM-D with exclusion (ours) vs without (theirs) — exclusion produces sharper predictions and provably normalizes.

Score-First Legal TTT

PR	Contribution
#461 (MERGED)	Introduced score-first TTT, proved legal under Issue #1017 C1–C4
#549 (MERGED)	Extended score-first TTT
#1735	Parallel TTT, 21 epochs in eval budget
#1851	SmearGate BOS + score-first TTT, post-TTT BPB 1.06128
#1868	Clean neural baseline
#1876	Coprime-Stride + Full GPTQ + Score-First TTT, BPB 1.08008
#1881	PPM-D mixture 0.9019 BPB (invalid uniform-spread)

Path A: Computationally Intractable

Path A (token-normalized PPM mixture) required O(V=8192) PPM-D evaluations per token position — projected at ~38 days CPU-only. Even with C++ backend (17-50× speedup), it exceeds practical budgets. Archived with full materials.

Red-team investigation

The full formal legality proof (docs/legality/ppmd-legality-proof.md) includes 6 theorems:

Theorems 1, 4, 5, 6: ✅ Verified (PPM-D normalization, score-before-update, denominators, coverage)
Theorems 2, 3: ❌ Disproved (geometric-mean neural bytes, mixture — NOT valid distributions)

Full-val legality proof: docs/legality/ppmd-legality-proof-fullval-result.md

Files changed

README.md — Major update: headline full-val result, formal math, PR Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain #1905 comparison, score-first TTT evidence
submission.json — Added full-val Path B fields
results/exp_1876_ppmd/path_b_prod_8gpu_fullval_local_score/path_b_sliding_full.json — Full-val result JSON
docs/legality/ppmd-legality-proof-fullval-result.md — Full-val legality proof
scripts/fast_score.py — Fast scoring utility
docs/path_a_archive/ — Archived Path A materials with intractability note

Artifact: 15,975,706 bytes (model) + 20,220 bytes (code) = 15,995,926 bytes (under 16 MB cap)

@aquariouseworkman

3-seed reproduction of PR openai#1851 (SmearGate BOS document boundary fix). Code is byte-identical to openai#1851 by @aquariouseworkman. Results (post-TTT BPB): Seed 42: 1.06128 (original openai#1851 author) Seed 314: 1.06087 (this submission) Seed 1234: 1.06220 (this submission) Mean: 1.06145 ± 0.00068 All artifacts < 16,000,000 bytes. All runs < 600s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…76 model Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add non-record submission documenting the formal legality framework for score-first PPM-D mixtures with neural language models. Key contributions: - Formal 6-theorem legality proof: geometric-mean byte decomposition of token probabilities does NOT yield a proper distribution; PPM-D requires exclusion normalization; prior _ppm_mixture_bpb is invalid - Audited Path B first-8M-token subset result: mixture_bpb=1.5459 neural_only_bpb=1.5619 claim_ready=true - Path A (token-normalized, C++/CUDA backend) and Path B (byte-trie marginalization) as two constructive correction approaches - Full evidence bundle: 35 files incl. evaluator scripts, audit JSONs, machine-checkable test suite, plan docs, and production training log - No overclaim: val_bpb=null; full-val CPU postpass projects ~6–9 hr - 8xH100 training artifact: exp_1876, 11-layer 512d SP8192 transformer with depth recurrence, 4590 steps, 15.99MB artifact (within 16MB cap) Lineage: openai#1851 -> openai#1868 -> openai#1873 -> openai#1876 / openai#1877 / Issue openai#1872

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… 1.5221 - Add full-val Path B result (151,078,222 bytes, claim_ready=true) - Add formal mathematical description of byte-level vs token-level BPB - Add comparison with PR openai#1905 (independent normalization invalidity discovery) - Add Score-First Legal TTT evidence section (PRs openai#461, openai#549, openai#1735, openai#1851) - Archive Path A as computationally intractable - Bundle fast_score.py and full-val legality proof - Fix trie marginalization formula to reflect continuable mass implementation - Update submission.json with full-val fields Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

GitHub markdown parser treats <k, <t, <i in LaTeX subscripts as HTML tags. Replace with \lt to render correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…an 1.06141 Re-ran all 3 seeds (42, 314, 1234) with GPTQ_RESERVE_SECONDS=8.0 (was 0.5) to ensure GPTQ hessian collection completes within the 600s training budget. Code changes: - Serialize artifact immediately after training (before diagnostic eval) - Added timing instrumentation (serialize_wallclock, GPTQ sub-timings) Results (all seeds fresh re-run on RunPod 8×H100 SXM): Seed 42: post-TTT BPB = 1.06083, artifact = 15,949,701, eval = 525.5s Seed 314: post-TTT BPB = 1.06091, artifact = 15,951,777, eval = 429.5s Seed 1234: post-TTT BPB = 1.06249, artifact = 15,951,968, eval = 481.2s 3-seed mean: 1.06141 ± 0.00093 Compliance: training loop ends at ~592s, GPTQ hessians end at ~595.5s (<600s). RunPod cost: ~$31. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Christopher-Lee-McClendon and others added 6 commits April 27, 2026 15:19

CUDA backend Phases 2-4 + Stage 7 prep: bundle scripts/tests + exp_18…

fdc39c0

…76 model Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Stage 7: add full-eval clone-mode dispatch + _full_eval_clone_snippet

ed97d22

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Stage 7 fix: skip 80 training shards for eval-only data download

7586a9f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Christopher-Lee-McClendon changed the title ~~Non-record: Framework for Legal Score-First PPM-D Mixtures~~ Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures Apr 29, 2026

Christopher-Lee-McClendon and others added 2 commits April 29, 2026 10:45

Fix LaTeX rendering: escape angle brackets in subscripts

61d52d1

GitHub markdown parser treats <k, <t, <i in LaTeX subscripts as HTML tags. Replace with \lt to render correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures#1916

Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures#1916
Christopher-Lee-McClendon wants to merge 8 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-ppmd-framework

Christopher-Lee-McClendon commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Christopher-Lee-McClendon commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures

Key Results

What's new in this update

Formal Math

Token-Level BPB (standard contest metric)

Byte-Level Neural BPB (Path B — trie marginalization over continuable mass)

Mixture

Full-val BPB

Why neural_only_bpb (1.5430) ≠ token-level sliding BPB (1.0830)

Comparison with PR #1905

Score-First Legal TTT

Path A: Computationally Intractable

Red-team investigation

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Christopher-Lee-McClendon commented Apr 29, 2026 •

edited

Loading