Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures#1916
Open
Christopher-Lee-McClendon wants to merge 8 commits intoopenai:mainfrom
Conversation
3-seed reproduction of PR openai#1851 (SmearGate BOS document boundary fix). Code is byte-identical to openai#1851 by @aquariouseworkman. Results (post-TTT BPB): Seed 42: 1.06128 (original openai#1851 author) Seed 314: 1.06087 (this submission) Seed 1234: 1.06220 (this submission) Mean: 1.06145 ± 0.00068 All artifacts < 16,000,000 bytes. All runs < 600s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…76 model Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add non-record submission documenting the formal legality framework for
score-first PPM-D mixtures with neural language models.
Key contributions:
- Formal 6-theorem legality proof: geometric-mean byte decomposition
of token probabilities does NOT yield a proper distribution; PPM-D
requires exclusion normalization; prior _ppm_mixture_bpb is invalid
- Audited Path B first-8M-token subset result:
mixture_bpb=1.5459 neural_only_bpb=1.5619 claim_ready=true
- Path A (token-normalized, C++/CUDA backend) and Path B (byte-trie
marginalization) as two constructive correction approaches
- Full evidence bundle: 35 files incl. evaluator scripts, audit JSONs,
machine-checkable test suite, plan docs, and production training log
- No overclaim: val_bpb=null; full-val CPU postpass projects ~6–9 hr
- 8xH100 training artifact: exp_1876, 11-layer 512d SP8192 transformer
with depth recurrence, 4590 steps, 15.99MB artifact (within 16MB cap)
Lineage: openai#1851 -> openai#1868 -> openai#1873 -> openai#1876 / openai#1877 / Issue openai#1872
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… 1.5221 - Add full-val Path B result (151,078,222 bytes, claim_ready=true) - Add formal mathematical description of byte-level vs token-level BPB - Add comparison with PR openai#1905 (independent normalization invalidity discovery) - Add Score-First Legal TTT evidence section (PRs openai#461, openai#549, openai#1735, openai#1851) - Archive Path A as computationally intractable - Bundle fast_score.py and full-val legality proof - Fix trie marginalization formula to reflect continuable mass implementation - Update submission.json with full-val fields Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GitHub markdown parser treats <k, <t, <i in LaTeX subscripts as HTML tags. Replace with \lt to render correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…an 1.06141 Re-ran all 3 seeds (42, 314, 1234) with GPTQ_RESERVE_SECONDS=8.0 (was 0.5) to ensure GPTQ hessian collection completes within the 600s training budget. Code changes: - Serialize artifact immediately after training (before diagnostic eval) - Added timing instrumentation (serialize_wallclock, GPTQ sub-timings) Results (all seeds fresh re-run on RunPod 8×H100 SXM): Seed 42: post-TTT BPB = 1.06083, artifact = 15,949,701, eval = 525.5s Seed 314: post-TTT BPB = 1.06091, artifact = 15,951,777, eval = 429.5s Seed 1234: post-TTT BPB = 1.06249, artifact = 15,951,968, eval = 481.2s 3-seed mean: 1.06141 ± 0.00093 Compliance: training loop ends at ~592s, GPTQ hessians end at ~595.5s (<600s). RunPod cost: ~$31. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures
Key Results
claim_ready=truePPM-D mixture gain: −0.021 BPB over neural-only byte-level baseline (1.5430 → 1.5221). Note: This result is sliding-window only; no corrected Path B TTT result exists yet.
This is a non-record methodology submission documenting a rigorous framework for computing valid byte-level BPB with neural + PPM-D mixtures. The headline result is a full-validation, provably normalized, audited byte-level mixture BPB of 1.5221 — the first such audited number in the contest.
What's new in this update
Full-validation Path B result (previously only had 8M-token subset):
mixture_bpb = 1.5221(was 1.5459 on 8M subset)neural_only_bpb = 1.5430(was 1.5619 on 8M subset)Comparison with PR Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain #1905 by @leon2k2k2k:
Formal mathematical description of byte-level vs token-level BPB metrics
Path A archived as computationally intractable (O(V=8192) per position)
Score-First Legal TTT evidence section citing PRs Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461, Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549, Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735, Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851, Record: SmearGate BOS Fix 3-Seed Compliance Re-run — val_bpb 1.06141 (3-seed mean) #1868, Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean) #1876
Formal Math
Token-Level BPB (standard contest metric)
Byte-Level Neural BPB (Path B — trie marginalization over continuable mass)
where$C(\pi)$ denotes tokens whose byte sequences strictly extend $\pi$ (excluding tokens terminating exactly at $\pi$ ). This is a proper conditional distribution by construction.
Mixture
Full-val BPB
Why neural_only_bpb (1.5430) ≠ token-level sliding BPB (1.0830)
Token-level BPB distributes each token's cross-entropy uniformly across its bytes. Byte-level BPB asks "given the byte prefix emitted so far, what is the conditional probability of the next byte?" — a fundamentally harder task requiring the model to resolve within-token byte ambiguity.
Comparison with PR #1905
Both this submission and PR #1905 (by @leon2k2k2k) independently discovered the same normalization invalidity in geometric-mean byte decomposition. Both implement correct trie-based conditional byte distributions. Yet the mixture effect diverges:
Both independently confirmed: uniform-spread (geometric mean) byte decomposition is NOT a valid probability distribution (sums > 1). The key difference is PPM-D with exclusion (ours) vs without (theirs) — exclusion produces sharper predictions and provably normalizes.
Score-First Legal TTT
Path A: Computationally Intractable
Path A (token-normalized PPM mixture) required O(V=8192) PPM-D evaluations per token position — projected at ~38 days CPU-only. Even with C++ backend (17-50× speedup), it exceeds practical budgets. Archived with full materials.
Red-team investigation
The full formal legality proof (
docs/legality/ppmd-legality-proof.md) includes 6 theorems:Full-val legality proof:
docs/legality/ppmd-legality-proof-fullval-result.mdFiles changed
README.md— Major update: headline full-val result, formal math, PR Report: PPM-D byte-level scoring is not a valid probability distribution, and why it appears to gain #1905 comparison, score-first TTT evidencesubmission.json— Added full-val Path B fieldsresults/exp_1876_ppmd/path_b_prod_8gpu_fullval_local_score/path_b_sliding_full.json— Full-val result JSONdocs/legality/ppmd-legality-proof-fullval-result.md— Full-val legality proofscripts/fast_score.py— Fast scoring utilitydocs/path_a_archive/— Archived Path A materials with intractability noteArtifact: 15,975,706 bytes (model) + 20,220 bytes (code) = 15,995,926 bytes (under 16 MB cap)