Skip to content

Record: SP8192 + Order-6 Strict Full-Val Byte PPM — 0.96255 BPB (3-seed mean)#1877

Open
someone114514 wants to merge 1 commit intoopenai:mainfrom
someone114514:record-sp8192-order6-strict-byte-ppm-0427
Open

Record: SP8192 + Order-6 Strict Full-Val Byte PPM — 0.96255 BPB (3-seed mean)#1877
someone114514 wants to merge 1 commit intoopenai:mainfrom
someone114514:record-sp8192-order6-strict-byte-ppm-0427

Conversation

@someone114514
Copy link
Copy Markdown

SP8192 + Order-6 Strict Full-Val Byte PPM

val_bpb = 0.96255 (3-seed mean, std 0.00047) | 15.997 MB mean artifact | 8xH100 SXM

This submission keeps the SP8192 recurrence / parallel-residual / QK-gain base stack and replaces the prior order-4 PPM setting with a strict full-validation order-6 byte-level PPM mixture at eval time. The PPM state is built online from the already-scored byte prefix, then updated only after each byte is scored.

Results

Seed Post-EMA BPB PPM BPB Artifact bytes Eval time
42 1.08754884 0.96261595 15,996,904 474.016s
7 1.08763287 0.96298648 15,999,992 464.055s
1337 1.08663175 0.96205812 15,994,492 463.261s
Mean 1.08727115 0.96255352 15,997,129 467.111s
Std 0.00055533 0.00046732 2,757 5.993s

The best seed is 1337 at 0.96205812 BPB. The largest observed total submission size is 15,999,992 bytes, still under the 16,000,000 byte cap.

Method

The eval path first computes the normal sliding-window neural-network NLLs with stride 64. It then converts the scored token stream into byte contributions and mixes the NN byte probability with an order-6 byte PPM-D probability:

p_mix = lambda * p_nn + (1 - lambda) * p_ppm

The gate is binary and prefix-only. With the submitted settings, PPM is trusted more when its longest-context top-symbol confidence is at least 0.9; otherwise the NN dominates.

Setting Value
PPM_ORDER 6
PPM_LAMBDA_HI 0.9
PPM_LAMBDA_LO 0.05
PPM_CONF_THRESHOLD 0.9
PPM_LOG_CACHE_SIZE 1048576
SKIP_QUANTIZED_EVAL 1
SLIDING_BATCH_SEQS 32

Order 6 was selected after full-val checks. Order 7 and order 8 were slower and worse on seed 42, so they are not part of the submitted result.

Compliance

  • Causal scoring: both NN scoring and PPM scoring use only the prefix available before the current byte.
  • Score before update: PPM counts are updated after the byte's mixed log-probability is recorded.
  • Single pass: validation bytes are scored once in order; there is no rescoring or best-of-run selection.
  • Normalized distribution: PPM-D produces a valid byte distribution and the mixture is performed in probability space.
  • Full validation: submitted scores use the full validation stream, not a subset.
  • No SLOT, no TTT, no ETLB, and no n-gram cache in the submitted packed artifact.

Reproduce

RUN_ID=strict_ppm_order6_seed42 \
SEED=42 \
PPM_ENABLED=1 \
PPM_NATIVE_ENABLED=1 \
PPM_ORDER=6 \
PPM_LAMBDA_HI=0.9 \
PPM_LAMBDA_LO=0.05 \
PPM_CONF_THRESHOLD=0.9 \
PPM_LOG_CACHE_SIZE=1048576 \
SKIP_QUANTIZED_EVAL=1 \
SLIDING_BATCH_SEQS=32 \
torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-04-27_SP8192_Order6StrictBytePPM/train_gpt.py

Change SEED and RUN_ID to reproduce the other two logs.

@sharpobject
Copy link
Copy Markdown

If you score all token ids at a given token-wise position in the document, do the probabilities for all of these token ids given by the mix of the byte-wise PPM and the token-wise NN sum to 1? (hint: no)

GodlyDonuts added a commit to GodlyDonuts/parameter-golf that referenced this pull request Apr 28, 2026
…olar Express NS + MIN_LR + LQER)

Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877):
- openai#1852: hard rule violation (pre-quant TTT on validation data).
- openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted.
- openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over
  token alphabet), reviewer @sharpobject caught.
- openai#1855: techniques mostly legit but apt-get install lrzip violates Issue
  openai#1017 Rule 3 (artifact must be self-contained).
- openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal
  training-time techniques citing prior validated PRs. If it merges,
  our submission threshold shifts from 1.0760 to ~1.0627.

PR openai#1874's three techniques:
1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples
   replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5.
2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max
   instead of decaying to 0. Already wired in our v1+; just env-var
   opt-in.
3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) -
   SVD on top-K=3 highest-error GPTQ residuals, packed as int4
   per-group-64 asymmetric. ~200-400 LOC; deferred to v4.

train_gpt_v3.py implements (1) and exposes (2):
- POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off).
- _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at
  import time so torch.compile sees them as constants.
- zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use
  per-iteration coefficients instead of fixed.
- MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in.

Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst-
seed artifact slack: ~4,888 bytes under cap. Tight but workable.

AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux).

Stacking projection (single-seed):
- Phase 0 baseline:       1.08038
- + LR=0.010 (Stage 2):   1.08021
- + Polar Express NS:     1.0787-1.0797
- + MIN_LR=0.10:          1.0777-1.0794
- + ConfTTT (PR openai#1879):   1.0772-1.0793
- + LQER (v4 work):       1.0742-1.0783
- + Phase 2 architecture: 1.0712-1.0773
- + Newton-Muon Stage E:  1.066-1.075

Path B (absorb-and-stack) recommended over Path A (race-to-merge-with-
current-stack) since current stack alone doesn't clear 1.0760.

Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open.
Whichever merges first becomes new SOTA and our threshold tightens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants