Record: DualClock PPM token-context mixture — 1.0803 BPB seed 42 by okezue · Pull Request #1879 · openai/parameter-golf

okezue · 2026-04-28T02:05:03Z

Summary

Score-first eval-time mixture over the neural model and two causal PPM experts (global stream + document-local) using a fixed-share Bayesian mixer. GPU-vectorized via FNV rolling hashes, hash-bucketed count tables, and a prefix-rank counter for chunk-local causal scoring.

Final val_bpb on seed 42: 1.080326 (8×H100 SXM, brotli artifact 15,977,914 bytes).

Folder: `records/track_10min_16mb/2026-04-27_DualClockPPM_TokenContextMixture/`

Result breakdown

component	val_bpb	eval_time
pre-quantization post-EMA	1.087045	6.8s
quantized	1.098230	8.6s
quantized + sliding window	1.081633	91.1s
quantized + TTT + DualClock mixture	1.080326	361.9s

Final mixture weights converged to `[0.99950, 0.00035, 0.00015]` over (neural, global PPM, local PPM).

Implementation

At evaluation time, every chunk is scored under `torch.no_grad()` before any TTT update on that chunk. As each chunk is scored, the per-token neural NLL is gathered across DDP ranks via `all_reduce(MIN)` on a buffer initialized to `+∞`, since each rank scores a disjoint window subset and the score buffer is later consumed by a single deterministic mixture pass that runs on every rank.

Per chunk, we construct two causal PPM experts. Each expert maintains, for every order `k = 1..K`, a hash-bucketed count table `cnt[k]` of shape `(B·V,)` (int32) holding continuation counts and a context-mass table `tot[k]` of shape `(B,)`. The order-`k` context for a target at chunk position `i` is the FNV-rolling hash of the last `k` source tokens, computed in a vectorized loop over `k`. The hash modulo `B` gives the bucket id; the resulting `(ctx_hash, target_token)` pair indexes into `cnt[k]`. Predictions are produced by the recursive Dirichlet-backoff smoothing chain `q_k(a) = (cnt_k(a) + α·q_{k-1}(a)) / (tot_k + α)` where `q_0` is a causal running unigram prior over the full SP8192 vocabulary, so each expert is by construction a fully normalized distribution over the official token alphabet.

Within a chunk, the strictly causal contribution from already-seen positions is enforced by a vectorized prefix-rank counter (`dc_prank`): for the `n` positions in the chunk it returns, for each position, the count of identical earlier-position keys, which is the chunk-local correction that lets us score the entire chunk in parallel without ever leaking a token's identity into its own predicted probability. After the chunk's score is finalized, the count tables and the unigram base prior are updated with that chunk's tokens via GPU `index_add`. The global expert never resets; the document-local expert resets at chunk boundaries as a doc-boundary proxy.

The three experts (neural, global PPM, local PPM) are blended through a fixed-share Bayesian mixer. Each chunk yields a per-expert mean log-loss `L_i`; the chunk-end posterior is `post_i ∝ w_i · exp(-(L_i - L_min))`, then `w_new = (1 - share) · post + share · prior` with prior `(0.90, 0.07, 0.03)` and `share = 0.005`. This bounds cumulative log-loss against the best switching expert across chunks while keeping a small floor of probability on the slower-moving experts so they can re-engage if the dominant expert ever changes. Mixture weights for chunk `c+1` are a deterministic function of chunks `0..c`'s losses, which is the same legality property the neural TTT update path relies on.

Hyperparameters

```
DC_ENABLED=1 DC_ORDER_G=6 DC_ORDER_L=8 DC_BUCKETS_G=2048 DC_BUCKETS_L=2048
DC_ALPHA_G=1.0 DC_ALPHA_L=0.5 DC_EPS_UNI=0.25 DC_SHARE=0.005
DC_PRIOR='0.90,0.07,0.03' DC_DOC_RESET=1
```

Test plan

seed 42 on 8×H100 SXM, brotli artifact 15,977,914 bytes
full normalized distribution over the official SP8192 vocabulary verified by construction
score-before-update legality verified: counts and mixer weights update only after the chunk is scored

…ate on PR openai#1493 stack

okezue · 2026-04-28T02:07:52Z

@cocohearts @willdepue could you take a look when you get a chance? This is a record submission at 1.0803 BPB seed 42, beating the prior 1.0810 leader. The full normalized-distribution + score-before-update legality is preserved per the rules (details in the PR description). Happy to run additional seeds if useful.

…ee submissions Reaction to okezue's PR openai#1879 (DualClockPPM 1.0803 single-seed). Corrects two things in Claude's prior take: 1. PR openai#1879 contains FOUR submissions, not one. The 6,817 LOC is spread across DualClockPPM (624), NGM-Hedge (526), ConfTTT (485), and PSO Persistent Spectral (1,609), plus READMEs and a courtesy decompressed dump of bigbag's PR openai#1493 source. Largest single algo addition is 1,609 LOC. 2. DualClockPPM's own README explicitly discloses neural-only = 1.080301 in the same run vs mixed = 1.080326. The PPM mixture is slightly *negative*, not "essentially zero." Confirms hard skip on DualClock and the related NGM-Hedge. Free calibration data: okezue's neural-only seed-42 = 1.080301 is an independent reproduction of PR openai#1493 with default TTT_LR=0.005, within 0.00008 of our Phase 0 reproduction (1.080382). Validates our setup and quantifies our LR=0.010 Stage 2 win as a real -0.0008 nat improvement over the SOTA-default reproduction floor. Cherry-pick recommendation: ConfTTT - confidence-weighted TTT that weights per-token cross-entropy by score-pass NLL (focal-loss / hard- example mining for the TTT inner loop). ~25 LOC, eval-time only, score-first compliant. Stacks with our LR=0.010 winner. Smoke test cost: $3 for one eval-only run on the saved Phase 0 artifact. Skip list: - DualClockPPM and NGM-Hedge: same family, neural model dominates the mixture so weights collapse to neural-only. - PSO Persistent Spectral: same class as our Newton-Muon. Newton-Muon has Modded-NanoGPT empirical validation at our scale; PSO has only theorems. Keep our horse. Threat model: PR openai#1879 doesn't break the 0.005 record threshold itself, but if DualClock 3-seed validates and merges, our submission threshold shifts to ~1.0753 (-0.001 worse for us). okezue is iterating fast (4 submissions in one PR, today). Move quickly; if we land first the race is moot. Includes the exact 5-chunk patch (Hyperparameters + 4 sites in eval_val_ttt) ready to drop into Claude/train_gpt.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@sharpobject

…olar Express NS + MIN_LR + LQER) Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877): - openai#1852: hard rule violation (pre-quant TTT on validation data). - openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted. - openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over token alphabet), reviewer @sharpobject caught. - openai#1855: techniques mostly legit but apt-get install lrzip violates Issue openai#1017 Rule 3 (artifact must be self-contained). - openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal training-time techniques citing prior validated PRs. If it merges, our submission threshold shifts from 1.0760 to ~1.0627. PR openai#1874's three techniques: 1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5. 2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max instead of decaying to 0. Already wired in our v1+; just env-var opt-in. 3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) - SVD on top-K=3 highest-error GPTQ residuals, packed as int4 per-group-64 asymmetric. ~200-400 LOC; deferred to v4. train_gpt_v3.py implements (1) and exposes (2): - POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off). - _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at import time so torch.compile sees them as constants. - zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use per-iteration coefficients instead of fixed. - MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in. Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst- seed artifact slack: ~4,888 bytes under cap. Tight but workable. AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux). Stacking projection (single-seed): - Phase 0 baseline: 1.08038 - + LR=0.010 (Stage 2): 1.08021 - + Polar Express NS: 1.0787-1.0797 - + MIN_LR=0.10: 1.0777-1.0794 - + ConfTTT (PR openai#1879): 1.0772-1.0793 - + LQER (v4 work): 1.0742-1.0783 - + Phase 2 architecture: 1.0712-1.0773 - + Newton-Muon Stage E: 1.066-1.075 Path B (absorb-and-stack) recommended over Path A (race-to-merge-with- current-stack) since current stack alone doesn't clear 1.0760. Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open. Whichever merges first becomes new SOTA and our threshold tightens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

h1beee · 2026-04-28T07:16:18Z

You need to be lower than 1.076 bpb

https://github.com/openai/parameter-golf#submission-process

okezue · 2026-04-28T16:44:31Z

Update — broader context on PR #1879

Beyond the DualClock submission in this PR, I separately ran a study reproducing the open SOTA stack from PR #1855. Posting here just for context — the DualClock contribution and the reproduction work are independent.

Independent reproduction of PR #1855 on a fresh 8×H100 SXM pod (cu129/torch 2.9.1/FA3 cu129_torch291) with the env vars from #1855's hparam table plus SMEAR_GATE_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 EMBED_BITS=7 MIN_LR=0.1 GPTQ_RESERVE_SECONDS=0.5 PHASED_TTT_NUM_PHASES=3, 600 s wallclock:

seed	post-TTT	compressor	artifact
42	1.05965	brotli	16,112,007 (over cap)
314	1.06041	brotli	~16.11 M (over cap)
999	1.06124	brotli	~16.11 M (over cap)
mean	1.06043
42	1.06052	pergroup	15,902,285 ✅

3-seed brotli mean 1.06043 vs PR #1855's reported 1.06108 — within 1σ. Posted the same data on #1855 as a reproduction confirmation.

The DualClock submission in this PR (1.0803 BPB seed 42) remains a separate contribution. To be fully transparent: the Bayesian mixer auto-collapses to ~99.95% on the neural model after the first few chunks, so DualClock is empirically neutral on a strong neural baseline — the safety guarantee holds, but the small-vocab PPM experts don't add information the neural model is missing. Documented as-is.

okezue added 11 commits March 28, 2026 16:14

PP12: Bayesian posterior packets + selective gating (1.1261 BPB)

8ef2bea

PSO: Persistent Spectral Optimizer on PR openai#1394 base stack

3a970b6

fix: PSO update magnitude to match Muon spectral norm convention

717352b

Hybrid: PSO on late-attn only, Muon polar on rest

bc57c86

Confidence-weighted legal TTT on PR openai#1493 stack

a51be85

fix: Python 3.10-compatible f-string

82c3f23

fix: more Python 3.10 f-string compat

e5befbf

NGM-Hedge: normalized n-gram exponential mixture on PR openai#1493 stack

22cee46

DualClock PPM: global+local PPM mixture with fixed-share Bayesian upd…

397d588

…ate on PR openai#1493 stack

DualClock: fix int64 overflow in FNV hash constants

d362a2e

DualClock PPM token-context mixture: 1.0803 BPB seed 42

a5f94e0

okezue changed the title ~~Non-record: DualClock PPM token-context mixture (1.0803 BPB)~~ Record: DualClock PPM token-context mixture — 1.0803 BPB seed 42 Apr 28, 2026

DualClock README: paragraph form for implementation section

19bb040

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: DualClock PPM token-context mixture — 1.0803 BPB seed 42#1879

Record: DualClock PPM token-context mixture — 1.0803 BPB seed 42#1879
okezue wants to merge 12 commits intoopenai:mainfrom
okezue:pso-submission

okezue commented Apr 28, 2026 •

edited

Loading

Uh oh!

okezue commented Apr 28, 2026

Uh oh!

h1beee commented Apr 28, 2026

Uh oh!

okezue commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

okezue commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Result breakdown

Implementation

Hyperparameters

Test plan

Uh oh!

okezue commented Apr 28, 2026

Uh oh!

h1beee commented Apr 28, 2026

Uh oh!

okezue commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update — broader context on PR #1879

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

okezue commented Apr 28, 2026 •

edited

Loading

okezue commented Apr 28, 2026 •

edited

Loading