Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer#803
Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer#803pentxayc wants to merge 1 commit intoopenai:mainfrom
Conversation
3-seed mean: 0.4416 (seeds 42, 1337, 2024, std 0.0001) Key innovation: complementary training (COMPLEMENT_ALPHA=0.5) trains the model to specialize on tokens that n-gram caches can't predict, enabling higher eval-time alpha (n-gram gets 20-75% weight via entropy-adaptive blending). Stack: 11L VRL + LeakyReLU² + XSA-4 + BackoffNgramMixer (orders 2-10) + AdamW TTT (4 epochs, Polyak 0.998) + int6 lzma quantization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
54 adaptive multipliers (order × entropy_bin × count_bin). Tracks beat rates per (order, low/mid/high entropy, low/mid/high count). Orders bumped from 2-7 to 2-9 (closer to openai#803's 2-10). Based on xwing_fast with safe speed boosts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Complementary training (PR openai#803): downweight tokens bigrams can predict, model specializes on what n-grams can't handle - 3D cubric: 54 adaptive multipliers (order × entropy × count) - Orders 2-9 (was 2-7) - Alpha range 0.20-0.75 (was 0.05-0.70) — enabled by complementary training making model/n-gram non-redundant - Safe speed boosts from xwing_fast Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reproduction of PR openai#803's complementary training approach on 8x L20Z (H100). Two-seed validation: 0.4377 (seed=1337), 0.4380 (seed=42). Key: bigram-weighted loss reweighting (COMPLEMENT_ALPHA=0.5) trains the neural model to specialize on tokens n-gram caches can't predict, combined with BackoffNgramMixer (orders 2-10) and legal score-first AdamW TTT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Today (2026-03-26) the leaderboard was transformed by eval-time n-gram backoff cache technique. Add comprehensive context for agents: - URGENT_ngram_backoff_breakthrough.md: full implementation guide with NgramEvalCache code, entropy-adaptive alpha, complementary training, priority order for implementation - latest_sota_snapshot.md: updated with new PR landscape - 3 reference code files from top PRs (openai#809 0.295, openai#803 0.442, openai#813 0.667) The n-gram backoff is purely eval-time — adding it to our existing best checkpoint should immediately jump from 1.119 to ~0.67 BPB. Implementing it is now the single highest-priority task. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…(legality review) - SOTA target is now PR openai#803: Complementary Training + Backoff N-gram + TTT - PR openai#809 (0.2952) excluded pending legality review - research_memory.md: fix Working SOTA Anchor section (agent had written it to explicitly ignore the URGENT file and stick to 1.1194 — removed that) - All PR openai#809 references updated to PR openai#803/openai#813 - Dashboard: SOTA now 0.4416, gap 0.681 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TrainNgramTracker maintains online bigram counts from training data. Per-token loss weight = 1 - alpha * P(y|x), clamped at 0.1. Model focuses capacity on hard-to-predict tokens, complementing the eval-time n-gram cache. PR openai#803 showed -0.258 BPB from this technique (0.700 → 0.442). Enabled via COMPLEMENT_ALPHA=0.5 (default 0, disabled). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3-seed mean 0.4027 BPB (std 0.0015): 1337=0.4024, 42=0.4044, 2024=0.4014 All artifacts under 16MB. Beats openai#803 (0.4416) by 0.0389 BPB. Causal sequential chunk eval with BackoffNgramMixer (orders 2-10). Swarm-guided training with KG-conditioned embedding init. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: 0.4416 BPB -- Complementary Training + Backoff N-gram MixerBPB: 0.4416 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1029 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.25s, dim=512, layers=11, vocab=1024, code=94053 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.25s, dim=512, layers=11, vocab=1024, code=94053 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Pre-answers the "where does the 0.0458 improvement come from" question using exact log excerpts from the three archived runs that produced submission.json: seed 7: neural 1.1481 -> +mixer 0.3948 (delta 0.7533) seed 1337: neural 1.1480 -> +mixer 0.3957 (delta 0.7523) seed 2024: neural 1.1492 -> +mixer 0.3969 (delta 0.7523) mean: neural 1.1484 -> +mixer 0.3958 (delta 0.7526) Includes the mixer convergence curve for seed 7 (1.176 -> 0.395 as counts accumulate in strict score-first order) and positions the submission as an eval-stage refinement of already-merged openai#779 and openai#803 rather than a novel training method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g + TTT. Results on our hardware: - PR openai#834 (11L, learned routing head + n-gram orders 2-7 + TTT 4ep): 0.1591 BPP - Their reported: 0.1663 (we got slightly better) - Eval time: 675s (over 600s budget — torch.compile slower on our hardware) - PR openai#803 baseline: 0.4377 (complementary training + n-gram order 10) - PR openai#803 + 14L: 0.4356 (slight improvement from depth) N-gram progression on our 14L model: - Order=5 alpha=0.40: 0.9870 - Order=7 alpha=0.55: 0.8977 - Complementary + order=10 alpha=0.75: 0.8264 Next: implement PPM/CTW to replace n-gram backoff, add 14L to PR834 script Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Key Innovation: Complementary Training
During training, tokens predictable by bigram statistics receive lower loss weight (
COMPLEMENT_ALPHA=0.5). The model specializes on tokens that n-gram caches can't predict — novel word choices, long-range dependencies, semantic surprises.This enables higher eval-time n-gram alpha (20-75% vs standard 5-60%) because the model is deliberately weak where n-grams are strong. The synergy:
Eval Stack
0.20 + 0.55 * sigmoid(2*(H - 3.0))— per-token blending based on model uncertaintyLegality
(1-α)·P_neural + α·P_ngram— proper mixture, all tokens have nonzero probability.Credits & Acknowledgments
This submission builds on techniques from several prior PRs:
The novel contribution is complementary training: reweighting the training loss by bigram predictability so the neural model specializes on tokens the n-gram cache can't handle, enabling significantly higher eval-time n-gram weight.
Test plan
🤖 Generated with Claude Code