Record: DualClock PPM token-context mixture — 1.0803 BPB seed 42#1879
Record: DualClock PPM token-context mixture — 1.0803 BPB seed 42#1879okezue wants to merge 12 commits intoopenai:mainfrom
Conversation
|
@cocohearts @willdepue could you take a look when you get a chance? This is a record submission at 1.0803 BPB seed 42, beating the prior 1.0810 leader. The full normalized-distribution + score-before-update legality is preserved per the rules (details in the PR description). Happy to run additional seeds if useful. |
…ee submissions Reaction to okezue's PR openai#1879 (DualClockPPM 1.0803 single-seed). Corrects two things in Claude's prior take: 1. PR openai#1879 contains FOUR submissions, not one. The 6,817 LOC is spread across DualClockPPM (624), NGM-Hedge (526), ConfTTT (485), and PSO Persistent Spectral (1,609), plus READMEs and a courtesy decompressed dump of bigbag's PR openai#1493 source. Largest single algo addition is 1,609 LOC. 2. DualClockPPM's own README explicitly discloses neural-only = 1.080301 in the same run vs mixed = 1.080326. The PPM mixture is slightly *negative*, not "essentially zero." Confirms hard skip on DualClock and the related NGM-Hedge. Free calibration data: okezue's neural-only seed-42 = 1.080301 is an independent reproduction of PR openai#1493 with default TTT_LR=0.005, within 0.00008 of our Phase 0 reproduction (1.080382). Validates our setup and quantifies our LR=0.010 Stage 2 win as a real -0.0008 nat improvement over the SOTA-default reproduction floor. Cherry-pick recommendation: ConfTTT - confidence-weighted TTT that weights per-token cross-entropy by score-pass NLL (focal-loss / hard- example mining for the TTT inner loop). ~25 LOC, eval-time only, score-first compliant. Stacks with our LR=0.010 winner. Smoke test cost: $3 for one eval-only run on the saved Phase 0 artifact. Skip list: - DualClockPPM and NGM-Hedge: same family, neural model dominates the mixture so weights collapse to neural-only. - PSO Persistent Spectral: same class as our Newton-Muon. Newton-Muon has Modded-NanoGPT empirical validation at our scale; PSO has only theorems. Keep our horse. Threat model: PR openai#1879 doesn't break the 0.005 record threshold itself, but if DualClock 3-seed validates and merges, our submission threshold shifts to ~1.0753 (-0.001 worse for us). okezue is iterating fast (4 submissions in one PR, today). Move quickly; if we land first the race is moot. Includes the exact 5-chunk patch (Hyperparameters + 4 sites in eval_val_ttt) ready to drop into Claude/train_gpt.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…olar Express NS + MIN_LR + LQER) Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877): - openai#1852: hard rule violation (pre-quant TTT on validation data). - openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted. - openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over token alphabet), reviewer @sharpobject caught. - openai#1855: techniques mostly legit but apt-get install lrzip violates Issue openai#1017 Rule 3 (artifact must be self-contained). - openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal training-time techniques citing prior validated PRs. If it merges, our submission threshold shifts from 1.0760 to ~1.0627. PR openai#1874's three techniques: 1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5. 2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max instead of decaying to 0. Already wired in our v1+; just env-var opt-in. 3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) - SVD on top-K=3 highest-error GPTQ residuals, packed as int4 per-group-64 asymmetric. ~200-400 LOC; deferred to v4. train_gpt_v3.py implements (1) and exposes (2): - POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off). - _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at import time so torch.compile sees them as constants. - zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use per-iteration coefficients instead of fixed. - MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in. Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst- seed artifact slack: ~4,888 bytes under cap. Tight but workable. AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux). Stacking projection (single-seed): - Phase 0 baseline: 1.08038 - + LR=0.010 (Stage 2): 1.08021 - + Polar Express NS: 1.0787-1.0797 - + MIN_LR=0.10: 1.0777-1.0794 - + ConfTTT (PR openai#1879): 1.0772-1.0793 - + LQER (v4 work): 1.0742-1.0783 - + Phase 2 architecture: 1.0712-1.0773 - + Newton-Muon Stage E: 1.066-1.075 Path B (absorb-and-stack) recommended over Path A (race-to-merge-with- current-stack) since current stack alone doesn't clear 1.0760. Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open. Whichever merges first becomes new SOTA and our threshold tightens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
You need to be lower than 1.076 bpb https://github.com/openai/parameter-golf#submission-process |
Update — broader context on PR #1879Beyond the DualClock submission in this PR, I separately ran a study reproducing the open SOTA stack from PR #1855. Posting here just for context — the DualClock contribution and the reproduction work are independent. Independent reproduction of PR #1855 on a fresh 8×H100 SXM pod (cu129/torch 2.9.1/FA3 cu129_torch291) with the env vars from #1855's hparam table plus
3-seed brotli mean 1.06043 vs PR #1855's reported 1.06108 — within 1σ. Posted the same data on #1855 as a reproduction confirmation. The DualClock submission in this PR (1.0803 BPB seed 42) remains a separate contribution. To be fully transparent: the Bayesian mixer auto-collapses to ~99.95% on the neural model after the first few chunks, so DualClock is empirically neutral on a strong neural baseline — the safety guarantee holds, but the small-vocab PPM experts don't add information the neural model is missing. Documented as-is. |
Summary
Score-first eval-time mixture over the neural model and two causal PPM experts (global stream + document-local) using a fixed-share Bayesian mixer. GPU-vectorized via FNV rolling hashes, hash-bucketed count tables, and a prefix-rank counter for chunk-local causal scoring.
Final val_bpb on seed 42: 1.080326 (8×H100 SXM, brotli artifact 15,977,914 bytes).
Folder: `records/track_10min_16mb/2026-04-27_DualClockPPM_TokenContextMixture/`
Result breakdown
Final mixture weights converged to `[0.99950, 0.00035, 0.00015]` over (neural, global PPM, local PPM).
Implementation
At evaluation time, every chunk is scored under `torch.no_grad()` before any TTT update on that chunk. As each chunk is scored, the per-token neural NLL is gathered across DDP ranks via `all_reduce(MIN)` on a buffer initialized to `+∞`, since each rank scores a disjoint window subset and the score buffer is later consumed by a single deterministic mixture pass that runs on every rank.
Per chunk, we construct two causal PPM experts. Each expert maintains, for every order `k = 1..K`, a hash-bucketed count table `cnt[k]` of shape `(B·V,)` (int32) holding continuation counts and a context-mass table `tot[k]` of shape `(B,)`. The order-`k` context for a target at chunk position `i` is the FNV-rolling hash of the last `k` source tokens, computed in a vectorized loop over `k`. The hash modulo `B` gives the bucket id; the resulting `(ctx_hash, target_token)` pair indexes into `cnt[k]`. Predictions are produced by the recursive Dirichlet-backoff smoothing chain `q_k(a) = (cnt_k(a) + α·q_{k-1}(a)) / (tot_k + α)` where `q_0` is a causal running unigram prior over the full SP8192 vocabulary, so each expert is by construction a fully normalized distribution over the official token alphabet.
Within a chunk, the strictly causal contribution from already-seen positions is enforced by a vectorized prefix-rank counter (`dc_prank`): for the `n` positions in the chunk it returns, for each position, the count of identical earlier-position keys, which is the chunk-local correction that lets us score the entire chunk in parallel without ever leaking a token's identity into its own predicted probability. After the chunk's score is finalized, the count tables and the unigram base prior are updated with that chunk's tokens via GPU `index_add`. The global expert never resets; the document-local expert resets at chunk boundaries as a doc-boundary proxy.
The three experts (neural, global PPM, local PPM) are blended through a fixed-share Bayesian mixer. Each chunk yields a per-expert mean log-loss `L_i`; the chunk-end posterior is `post_i ∝ w_i · exp(-(L_i - L_min))`, then `w_new = (1 - share) · post + share · prior` with prior `(0.90, 0.07, 0.03)` and `share = 0.005`. This bounds cumulative log-loss against the best switching expert across chunks while keeping a small floor of probability on the slower-moving experts so they can re-engage if the dominant expert ever changes. Mixture weights for chunk `c+1` are a deterministic function of chunks `0..c`'s losses, which is the same legality property the neural TTT update path relies on.
Hyperparameters
```
DC_ENABLED=1 DC_ORDER_G=6 DC_ORDER_L=8 DC_BUCKETS_G=2048 DC_BUCKETS_L=2048
DC_ALPHA_G=1.0 DC_ALPHA_L=0.5 DC_EPS_UNI=0.25 DC_SHARE=0.005
DC_PRIOR='0.90,0.07,0.03' DC_DOC_RESET=1
```
Test plan