Skip to content

Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures#1916

Open
Christopher-Lee-McClendon wants to merge 8 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-ppmd-framework
Open

Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures#1916
Christopher-Lee-McClendon wants to merge 8 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-ppmd-framework

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown
Contributor

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Apr 29, 2026

Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures

Key Results

Metric Value Status
Full-val Path B mixture BPB 1.5221 (151,078,222 bytes) ✅ Audited, claim_ready=true
Full-val Path B neural-only BPB 1.5430 ✅ Audited
Full-val Path B PPM-D-only BPB 2.0319 ✅ Audited
Neural-only sliding BPB (token-level) 1.0830 ✅ Valid
Neural-only TTT BPB (token-level) 1.0812 ✅ Valid
Invalid in-source PPM mixture 0.9949 ❌ Invalid — see formal proof

PPM-D mixture gain: −0.021 BPB over neural-only byte-level baseline (1.5430 → 1.5221). Note: This result is sliding-window only; no corrected Path B TTT result exists yet.

This is a non-record methodology submission documenting a rigorous framework for computing valid byte-level BPB with neural + PPM-D mixtures. The headline result is a full-validation, provably normalized, audited byte-level mixture BPB of 1.5221 — the first such audited number in the contest.


What's new in this update

  1. Full-validation Path B result (previously only had 8M-token subset):

    • mixture_bpb = 1.5221 (was 1.5459 on 8M subset)
    • neural_only_bpb = 1.5430 (was 1.5619 on 8M subset)
    • PPM-D improves by −0.021 BPB on full val (vs −0.016 on subset)
    • Runtime: ~9.4 hours offline CPU postpass
  2. Comparison with PR Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain #1905 by @leon2k2k2k:

  3. Formal mathematical description of byte-level vs token-level BPB metrics

  4. Path A archived as computationally intractable (O(V=8192) per position)

  5. Score-First Legal TTT evidence section citing PRs Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461, Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549, Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735, Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851, Record: SmearGate BOS Fix 3-Seed Compliance Re-run — val_bpb 1.06141 (3-seed mean) #1868, Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean) #1876


Formal Math

Token-Level BPB (standard contest metric)

$$\text{BPB}_{\text{token}} = \frac{\sum_{t=1}^{T} -\log p_{\text{NN}}(v_t \mid v_{\lt t})}{\log 2 \cdot B}$$

Byte-Level Neural BPB (Path B — trie marginalization over continuable mass)

$$p_{\text{NN}}^{\text{byte}}(b_k \mid \pi_{\lt k}) = \frac{\sum_{v \in C(\pi_{\lt k} b_k)} p_{\text{NN}}(v \mid v_{\lt t})}{\sum_{v \in C(\pi_{\lt k})} p_{\text{NN}}(v \mid v_{\lt t})}$$

where $C(\pi)$ denotes tokens whose byte sequences strictly extend $\pi$ (excluding tokens terminating exactly at $\pi$). This is a proper conditional distribution by construction.

Mixture

$$p_{\text{mix}}(b) = (1 - \lambda) \cdot p_{\text{NN}}^{\text{byte}}(b) + \lambda \cdot p_{\text{PPM}}(b)$$

$$\lambda = \begin{cases} 0.90 & \text{if } \max_b p_{\text{PPM}}(b) \geq 0.90 \ 0.05 & \text{otherwise} \end{cases}$$

Full-val BPB

$$\text{BPB}_{\text{byte}} = \frac{\sum_{i=1}^{B} -\log_2 p_{\text{mix}}(b_i \mid h_{\lt i})}{B} = \frac{\text{total NLL bits}}{151{,}078{,}222} = 1.5221$$

Why neural_only_bpb (1.5430) ≠ token-level sliding BPB (1.0830)

Token-level BPB distributes each token's cross-entropy uniformly across its bytes. Byte-level BPB asks "given the byte prefix emitted so far, what is the conditional probability of the next byte?" — a fundamentally harder task requiring the model to resolve within-token byte ambiguity.


Comparison with PR #1905

Both this submission and PR #1905 (by @leon2k2k2k) independently discovered the same normalization invalidity in geometric-mean byte decomposition. Both implement correct trie-based conditional byte distributions. Yet the mixture effect diverges:

Aspect This submission (Path B) PR #1905
PPM-D variant With exclusion (order 5) Without exclusion
Confidence gating PPM-D confidence-based (both use PPM-D side) PPM-D confidence-based
Mixture effect on own byte-level baseline −0.021 BPB (improvement) +0.038 BPB (degradation)
PPM-D helps? Yes No (hurts)

Note: The raw neural baselines (1.5430 vs 1.08335) are not directly comparable — ours is byte-level, theirs appears to be token-level. The meaningful comparison is the direction of the mixture effect on each submission's own baseline.

Both independently confirmed: uniform-spread (geometric mean) byte decomposition is NOT a valid probability distribution (sums > 1). The key difference is PPM-D with exclusion (ours) vs without (theirs) — exclusion produces sharper predictions and provably normalizes.


Score-First Legal TTT

PR Contribution
#461 (MERGED) Introduced score-first TTT, proved legal under Issue #1017 C1–C4
#549 (MERGED) Extended score-first TTT
#1735 Parallel TTT, 21 epochs in eval budget
#1851 SmearGate BOS + score-first TTT, post-TTT BPB 1.06128
#1868 Clean neural baseline
#1876 Coprime-Stride + Full GPTQ + Score-First TTT, BPB 1.08008
#1881 PPM-D mixture 0.9019 BPB (invalid uniform-spread)

Path A: Computationally Intractable

Path A (token-normalized PPM mixture) required O(V=8192) PPM-D evaluations per token position — projected at ~38 days CPU-only. Even with C++ backend (17-50× speedup), it exceeds practical budgets. Archived with full materials.


Red-team investigation

The full formal legality proof (docs/legality/ppmd-legality-proof.md) includes 6 theorems:

  • Theorems 1, 4, 5, 6: ✅ Verified (PPM-D normalization, score-before-update, denominators, coverage)
  • Theorems 2, 3: ❌ Disproved (geometric-mean neural bytes, mixture — NOT valid distributions)

Full-val legality proof: docs/legality/ppmd-legality-proof-fullval-result.md


Files changed

  • README.md — Major update: headline full-val result, formal math, PR Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain #1905 comparison, score-first TTT evidence
  • submission.json — Added full-val Path B fields
  • results/exp_1876_ppmd/path_b_prod_8gpu_fullval_local_score/path_b_sliding_full.json — Full-val result JSON
  • docs/legality/ppmd-legality-proof-fullval-result.md — Full-val legality proof
  • scripts/fast_score.py — Fast scoring utility
  • docs/path_a_archive/ — Archived Path A materials with intractability note

Artifact: 15,975,706 bytes (model) + 20,220 bytes (code) = 15,995,926 bytes (under 16 MB cap)

3-seed reproduction of PR openai#1851 (SmearGate BOS document boundary fix).
Code is byte-identical to openai#1851 by @aquariouseworkman.

Results (post-TTT BPB):
  Seed 42:   1.06128  (original openai#1851 author)
  Seed 314:  1.06087  (this submission)
  Seed 1234: 1.06220  (this submission)
  Mean:      1.06145 ± 0.00068

All artifacts < 16,000,000 bytes. All runs < 600s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…76 model

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add non-record submission documenting the formal legality framework for
score-first PPM-D mixtures with neural language models.

Key contributions:
- Formal 6-theorem legality proof: geometric-mean byte decomposition
  of token probabilities does NOT yield a proper distribution; PPM-D
  requires exclusion normalization; prior _ppm_mixture_bpb is invalid
- Audited Path B first-8M-token subset result:
    mixture_bpb=1.5459  neural_only_bpb=1.5619  claim_ready=true
- Path A (token-normalized, C++/CUDA backend) and Path B (byte-trie
  marginalization) as two constructive correction approaches
- Full evidence bundle: 35 files incl. evaluator scripts, audit JSONs,
  machine-checkable test suite, plan docs, and production training log
- No overclaim: val_bpb=null; full-val CPU postpass projects ~6–9 hr
- 8xH100 training artifact: exp_1876, 11-layer 512d SP8192 transformer
  with depth recurrence, 4590 steps, 15.99MB artifact (within 16MB cap)

Lineage: openai#1851 -> openai#1868 -> openai#1873 -> openai#1876 / openai#1877 / Issue openai#1872
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… 1.5221

- Add full-val Path B result (151,078,222 bytes, claim_ready=true)
- Add formal mathematical description of byte-level vs token-level BPB
- Add comparison with PR openai#1905 (independent normalization invalidity discovery)
- Add Score-First Legal TTT evidence section (PRs openai#461, openai#549, openai#1735, openai#1851)
- Archive Path A as computationally intractable
- Bundle fast_score.py and full-val legality proof
- Fix trie marginalization formula to reflect continuable mass implementation
- Update submission.json with full-val fields

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Christopher-Lee-McClendon Christopher-Lee-McClendon changed the title Non-record: Framework for Legal Score-First PPM-D Mixtures Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures Apr 29, 2026
GitHub markdown parser treats <k, <t, <i in LaTeX subscripts as HTML
tags. Replace with \lt to render correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…an 1.06141

Re-ran all 3 seeds (42, 314, 1234) with GPTQ_RESERVE_SECONDS=8.0 (was 0.5)
to ensure GPTQ hessian collection completes within the 600s training budget.

Code changes:
- Serialize artifact immediately after training (before diagnostic eval)
- Added timing instrumentation (serialize_wallclock, GPTQ sub-timings)

Results (all seeds fresh re-run on RunPod 8×H100 SXM):
  Seed 42:   post-TTT BPB = 1.06083, artifact = 15,949,701, eval = 525.5s
  Seed 314:  post-TTT BPB = 1.06091, artifact = 15,951,777, eval = 429.5s
  Seed 1234: post-TTT BPB = 1.06249, artifact = 15,951,968, eval = 481.2s
  3-seed mean: 1.06141 ± 0.00093

Compliance: training loop ends at ~592s, GPTQ hessians end at ~595.5s (<600s).
RunPod cost: ~$31.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant