Skip to content

Record: DualClock PPM token-context mixture — 1.0803 BPB seed 42#1879

Open
okezue wants to merge 12 commits intoopenai:mainfrom
okezue:pso-submission
Open

Record: DualClock PPM token-context mixture — 1.0803 BPB seed 42#1879
okezue wants to merge 12 commits intoopenai:mainfrom
okezue:pso-submission

Conversation

@okezue
Copy link
Copy Markdown

@okezue okezue commented Apr 28, 2026

Summary

Score-first eval-time mixture over the neural model and two causal PPM experts (global stream + document-local) using a fixed-share Bayesian mixer. GPU-vectorized via FNV rolling hashes, hash-bucketed count tables, and a prefix-rank counter for chunk-local causal scoring.

Final val_bpb on seed 42: 1.080326 (8×H100 SXM, brotli artifact 15,977,914 bytes).

Folder: `records/track_10min_16mb/2026-04-27_DualClockPPM_TokenContextMixture/`

Result breakdown

component val_bpb eval_time
pre-quantization post-EMA 1.087045 6.8s
quantized 1.098230 8.6s
quantized + sliding window 1.081633 91.1s
quantized + TTT + DualClock mixture 1.080326 361.9s

Final mixture weights converged to `[0.99950, 0.00035, 0.00015]` over (neural, global PPM, local PPM).

Implementation

At evaluation time, every chunk is scored under `torch.no_grad()` before any TTT update on that chunk. As each chunk is scored, the per-token neural NLL is gathered across DDP ranks via `all_reduce(MIN)` on a buffer initialized to `+∞`, since each rank scores a disjoint window subset and the score buffer is later consumed by a single deterministic mixture pass that runs on every rank.

Per chunk, we construct two causal PPM experts. Each expert maintains, for every order `k = 1..K`, a hash-bucketed count table `cnt[k]` of shape `(B·V,)` (int32) holding continuation counts and a context-mass table `tot[k]` of shape `(B,)`. The order-`k` context for a target at chunk position `i` is the FNV-rolling hash of the last `k` source tokens, computed in a vectorized loop over `k`. The hash modulo `B` gives the bucket id; the resulting `(ctx_hash, target_token)` pair indexes into `cnt[k]`. Predictions are produced by the recursive Dirichlet-backoff smoothing chain `q_k(a) = (cnt_k(a) + α·q_{k-1}(a)) / (tot_k + α)` where `q_0` is a causal running unigram prior over the full SP8192 vocabulary, so each expert is by construction a fully normalized distribution over the official token alphabet.

Within a chunk, the strictly causal contribution from already-seen positions is enforced by a vectorized prefix-rank counter (`dc_prank`): for the `n` positions in the chunk it returns, for each position, the count of identical earlier-position keys, which is the chunk-local correction that lets us score the entire chunk in parallel without ever leaking a token's identity into its own predicted probability. After the chunk's score is finalized, the count tables and the unigram base prior are updated with that chunk's tokens via GPU `index_add`. The global expert never resets; the document-local expert resets at chunk boundaries as a doc-boundary proxy.

The three experts (neural, global PPM, local PPM) are blended through a fixed-share Bayesian mixer. Each chunk yields a per-expert mean log-loss `L_i`; the chunk-end posterior is `post_i ∝ w_i · exp(-(L_i - L_min))`, then `w_new = (1 - share) · post + share · prior` with prior `(0.90, 0.07, 0.03)` and `share = 0.005`. This bounds cumulative log-loss against the best switching expert across chunks while keeping a small floor of probability on the slower-moving experts so they can re-engage if the dominant expert ever changes. Mixture weights for chunk `c+1` are a deterministic function of chunks `0..c`'s losses, which is the same legality property the neural TTT update path relies on.

Hyperparameters

```
DC_ENABLED=1 DC_ORDER_G=6 DC_ORDER_L=8 DC_BUCKETS_G=2048 DC_BUCKETS_L=2048
DC_ALPHA_G=1.0 DC_ALPHA_L=0.5 DC_EPS_UNI=0.25 DC_SHARE=0.005
DC_PRIOR='0.90,0.07,0.03' DC_DOC_RESET=1
```

Test plan

  • seed 42 on 8×H100 SXM, brotli artifact 15,977,914 bytes
  • full normalized distribution over the official SP8192 vocabulary verified by construction
  • score-before-update legality verified: counts and mixer weights update only after the chunk is scored

@okezue okezue changed the title Non-record: DualClock PPM token-context mixture (1.0803 BPB) Record: DualClock PPM token-context mixture — 1.0803 BPB seed 42 Apr 28, 2026
@okezue
Copy link
Copy Markdown
Author

okezue commented Apr 28, 2026

@cocohearts @willdepue could you take a look when you get a chance? This is a record submission at 1.0803 BPB seed 42, beating the prior 1.0810 leader. The full normalized-distribution + score-before-update legality is preserved per the rules (details in the PR description). Happy to run additional seeds if useful.

GodlyDonuts added a commit to GodlyDonuts/parameter-golf that referenced this pull request Apr 28, 2026
…ee submissions

Reaction to okezue's PR openai#1879 (DualClockPPM 1.0803 single-seed). Corrects
two things in Claude's prior take:

1. PR openai#1879 contains FOUR submissions, not one. The 6,817 LOC is spread
   across DualClockPPM (624), NGM-Hedge (526), ConfTTT (485), and
   PSO Persistent Spectral (1,609), plus READMEs and a courtesy
   decompressed dump of bigbag's PR openai#1493 source. Largest single algo
   addition is 1,609 LOC.

2. DualClockPPM's own README explicitly discloses neural-only = 1.080301
   in the same run vs mixed = 1.080326. The PPM mixture is slightly
   *negative*, not "essentially zero." Confirms hard skip on DualClock
   and the related NGM-Hedge.

Free calibration data: okezue's neural-only seed-42 = 1.080301 is an
independent reproduction of PR openai#1493 with default TTT_LR=0.005, within
0.00008 of our Phase 0 reproduction (1.080382). Validates our setup and
quantifies our LR=0.010 Stage 2 win as a real -0.0008 nat improvement
over the SOTA-default reproduction floor.

Cherry-pick recommendation: ConfTTT - confidence-weighted TTT that
weights per-token cross-entropy by score-pass NLL (focal-loss / hard-
example mining for the TTT inner loop). ~25 LOC, eval-time only,
score-first compliant. Stacks with our LR=0.010 winner. Smoke test
cost: $3 for one eval-only run on the saved Phase 0 artifact.

Skip list:
- DualClockPPM and NGM-Hedge: same family, neural model dominates the
  mixture so weights collapse to neural-only.
- PSO Persistent Spectral: same class as our Newton-Muon. Newton-Muon
  has Modded-NanoGPT empirical validation at our scale; PSO has only
  theorems. Keep our horse.

Threat model: PR openai#1879 doesn't break the 0.005 record threshold itself,
but if DualClock 3-seed validates and merges, our submission threshold
shifts to ~1.0753 (-0.001 worse for us). okezue is iterating fast (4
submissions in one PR, today). Move quickly; if we land first the
race is moot.

Includes the exact 5-chunk patch (Hyperparameters + 4 sites in
eval_val_ttt) ready to drop into Claude/train_gpt.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GodlyDonuts added a commit to GodlyDonuts/parameter-golf that referenced this pull request Apr 28, 2026
…olar Express NS + MIN_LR + LQER)

Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877):
- openai#1852: hard rule violation (pre-quant TTT on validation data).
- openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted.
- openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over
  token alphabet), reviewer @sharpobject caught.
- openai#1855: techniques mostly legit but apt-get install lrzip violates Issue
  openai#1017 Rule 3 (artifact must be self-contained).
- openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal
  training-time techniques citing prior validated PRs. If it merges,
  our submission threshold shifts from 1.0760 to ~1.0627.

PR openai#1874's three techniques:
1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples
   replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5.
2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max
   instead of decaying to 0. Already wired in our v1+; just env-var
   opt-in.
3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) -
   SVD on top-K=3 highest-error GPTQ residuals, packed as int4
   per-group-64 asymmetric. ~200-400 LOC; deferred to v4.

train_gpt_v3.py implements (1) and exposes (2):
- POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off).
- _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at
  import time so torch.compile sees them as constants.
- zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use
  per-iteration coefficients instead of fixed.
- MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in.

Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst-
seed artifact slack: ~4,888 bytes under cap. Tight but workable.

AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux).

Stacking projection (single-seed):
- Phase 0 baseline:       1.08038
- + LR=0.010 (Stage 2):   1.08021
- + Polar Express NS:     1.0787-1.0797
- + MIN_LR=0.10:          1.0777-1.0794
- + ConfTTT (PR openai#1879):   1.0772-1.0793
- + LQER (v4 work):       1.0742-1.0783
- + Phase 2 architecture: 1.0712-1.0773
- + Newton-Muon Stage E:  1.066-1.075

Path B (absorb-and-stack) recommended over Path A (race-to-merge-with-
current-stack) since current stack alone doesn't clear 1.0760.

Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open.
Whichever merges first becomes new SOTA and our threshold tightens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@h1beee
Copy link
Copy Markdown

h1beee commented Apr 28, 2026

You need to be lower than 1.076 bpb

https://github.com/openai/parameter-golf#submission-process

@okezue
Copy link
Copy Markdown
Author

okezue commented Apr 28, 2026

Update — broader context on PR #1879

Beyond the DualClock submission in this PR, I separately ran a study reproducing the open SOTA stack from PR #1855. Posting here just for context — the DualClock contribution and the reproduction work are independent.

Independent reproduction of PR #1855 on a fresh 8×H100 SXM pod (cu129/torch 2.9.1/FA3 cu129_torch291) with the env vars from #1855's hparam table plus SMEAR_GATE_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 EMBED_BITS=7 MIN_LR=0.1 GPTQ_RESERVE_SECONDS=0.5 PHASED_TTT_NUM_PHASES=3, 600 s wallclock:

seed post-TTT compressor artifact
42 1.05965 brotli 16,112,007 (over cap)
314 1.06041 brotli ~16.11 M (over cap)
999 1.06124 brotli ~16.11 M (over cap)
mean 1.06043
42 1.06052 pergroup 15,902,285 ✅

3-seed brotli mean 1.06043 vs PR #1855's reported 1.06108 — within 1σ. Posted the same data on #1855 as a reproduction confirmation.

The DualClock submission in this PR (1.0803 BPB seed 42) remains a separate contribution. To be fully transparent: the Bayesian mixer auto-collapses to ~99.95% on the neural model after the first few chunks, so DualClock is empirically neutral on a strong neural baseline — the safety guarantee holds, but the small-vocab PPM experts don't add information the neural model is missing. Documented as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants