Skip to content

Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer#803

Open
pentxayc wants to merge 1 commit intoopenai:mainfrom
pentxayc:submission/v11-complementary-backoff-0.4416
Open

Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer#803
pentxayc wants to merge 1 commit intoopenai:mainfrom
pentxayc:submission/v11-complementary-backoff-0.4416

Conversation

@pentxayc
Copy link
Copy Markdown

@pentxayc pentxayc commented Mar 26, 2026

Summary

  • 0.4416 BPB (3-seed mean, std 0.0001)
  • Seeds: 42 (0.4416), 1337 (0.4416), 2024 (0.4417)
  • 11L transformer (26.99M params) with VRL, LeakyReLU(0.5)², XSA-4
  • Artifact: 15,875,857 bytes (under 16MB)
  • Training: 4648 steps in 600s on 8xH100 SXM
  • Eval: 458s / 600s budget

Key Innovation: Complementary Training

During training, tokens predictable by bigram statistics receive lower loss weight (COMPLEMENT_ALPHA=0.5). The model specializes on tokens that n-gram caches can't predict — novel word choices, long-range dependencies, semantic surprises.

This enables higher eval-time n-gram alpha (20-75% vs standard 5-60%) because the model is deliberately weak where n-grams are strong. The synergy:

Config BPB
Base model only 1.139
+ Standard backoff (alpha=0.05) 0.700
+ Complementary training + alpha=0.20 0.442

Eval Stack

  • BackoffNgramMixer: orders 2-10, 4M flat hash buckets, greedy cascade (highest matching order wins)
  • Entropy-adaptive alpha: 0.20 + 0.55 * sigmoid(2*(H - 3.0)) — per-token blending based on model uncertainty
  • AdamW TTT: lr=5e-4, 4 epochs/chunk, Polyak EMA 0.998, freeze first 9/11 blocks
  • Sliding window: stride=64

Legality

  1. Complementary training uses only training-data bigram statistics. No validation data during training.
  2. N-gram cache built from already-scored tokens only (backward-looking, score-first).
  3. Alpha formula is a fixed function of model entropy — target-independent, committed before scoring.
  4. TTT is standard score-first legal TTT.
  5. Committed distribution: (1-α)·P_neural + α·P_ngram — proper mixture, all tokens have nonzero probability.

Credits & Acknowledgments

This submission builds on techniques from several prior PRs:

The novel contribution is complementary training: reweighting the training loss by bigram predictability so the neural model specializes on tokens the n-gram cache can't handle, enabling significantly higher eval-time n-gram weight.

Test plan

  • Seed 42: 0.4416 BPB
  • Seed 1337: 0.4416 BPB
  • Seed 2024: 0.4417 BPB

🤖 Generated with Claude Code

3-seed mean: 0.4416 (seeds 42, 1337, 2024, std 0.0001)

Key innovation: complementary training (COMPLEMENT_ALPHA=0.5) trains the model
to specialize on tokens that n-gram caches can't predict, enabling higher
eval-time alpha (n-gram gets 20-75% weight via entropy-adaptive blending).

Stack: 11L VRL + LeakyReLU² + XSA-4 + BackoffNgramMixer (orders 2-10) +
AdamW TTT (4 epochs, Polyak 0.998) + int6 lzma quantization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 26, 2026
54 adaptive multipliers (order × entropy_bin × count_bin).
Tracks beat rates per (order, low/mid/high entropy, low/mid/high count).
Orders bumped from 2-7 to 2-9 (closer to openai#803's 2-10).
Based on xwing_fast with safe speed boosts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 26, 2026
- Complementary training (PR openai#803): downweight tokens bigrams can
  predict, model specializes on what n-grams can't handle
- 3D cubric: 54 adaptive multipliers (order × entropy × count)
- Orders 2-9 (was 2-7)
- Alpha range 0.20-0.75 (was 0.05-0.70) — enabled by complementary
  training making model/n-gram non-redundant
- Safe speed boosts from xwing_fast

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
quietsmile added a commit to quietsmile/parameter-golf that referenced this pull request Mar 26, 2026
Reproduction of PR openai#803's complementary training approach on 8x L20Z (H100).
Two-seed validation: 0.4377 (seed=1337), 0.4380 (seed=42).

Key: bigram-weighted loss reweighting (COMPLEMENT_ALPHA=0.5) trains the
neural model to specialize on tokens n-gram caches can't predict, combined
with BackoffNgramMixer (orders 2-10) and legal score-first AdamW TTT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XinghanLi66 added a commit to XinghanLi66/parameter-golf that referenced this pull request Mar 26, 2026
Today (2026-03-26) the leaderboard was transformed by eval-time n-gram
backoff cache technique. Add comprehensive context for agents:

- URGENT_ngram_backoff_breakthrough.md: full implementation guide with
  NgramEvalCache code, entropy-adaptive alpha, complementary training,
  priority order for implementation
- latest_sota_snapshot.md: updated with new PR landscape
- 3 reference code files from top PRs (openai#809 0.295, openai#803 0.442, openai#813 0.667)

The n-gram backoff is purely eval-time — adding it to our existing best
checkpoint should immediately jump from 1.119 to ~0.67 BPB.
Implementing it is now the single highest-priority task.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
XinghanLi66 added a commit to XinghanLi66/parameter-golf that referenced this pull request Mar 26, 2026
…(legality review)

- SOTA target is now PR openai#803: Complementary Training + Backoff N-gram + TTT
- PR openai#809 (0.2952) excluded pending legality review
- research_memory.md: fix Working SOTA Anchor section (agent had written it
  to explicitly ignore the URGENT file and stick to 1.1194 — removed that)
- All PR openai#809 references updated to PR openai#803/openai#813
- Dashboard: SOTA now 0.4416, gap 0.681

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RoyiRa added a commit to RoyiRa/parameter-golf that referenced this pull request Mar 26, 2026
TrainNgramTracker maintains online bigram counts from training data.
Per-token loss weight = 1 - alpha * P(y|x), clamped at 0.1.
Model focuses capacity on hard-to-predict tokens, complementing the
eval-time n-gram cache.

PR openai#803 showed -0.258 BPB from this technique (0.700 → 0.442).
Enabled via COMPLEMENT_ALPHA=0.5 (default 0, disabled).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
michaelwinczuk added a commit to michaelwinczuk/parameter-golf that referenced this pull request Mar 29, 2026
3-seed mean 0.4027 BPB (std 0.0015): 1337=0.4024, 42=0.4044, 2024=0.4014
All artifacts under 16MB. Beats openai#803 (0.4416) by 0.0389 BPB.

Causal sequential chunk eval with BackoffNgramMixer (orders 2-10).
Swarm-guided training with KG-conditioned embedding init.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer

BPB: 0.4416 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA d9cf80d9d26e, file records/track_10min_16mb/2026-03-26_ComplementaryBackoff_0.4416/train_gpt.py):

The TTT path at line 1029 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.25s, dim=512, layers=11, vocab=1024, code=94053 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.25s, dim=512, layers=11, vocab=1024, code=94053 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

michaelwinczuk added a commit to michaelwinczuk/parameter-golf that referenced this pull request Apr 11, 2026
Pre-answers the "where does the 0.0458 improvement come from" question
using exact log excerpts from the three archived runs that produced
submission.json:

  seed 7:    neural 1.1481 -> +mixer 0.3948  (delta 0.7533)
  seed 1337: neural 1.1480 -> +mixer 0.3957  (delta 0.7523)
  seed 2024: neural 1.1492 -> +mixer 0.3969  (delta 0.7523)
  mean:      neural 1.1484 -> +mixer 0.3958  (delta 0.7526)

Includes the mixer convergence curve for seed 7 (1.176 -> 0.395 as counts
accumulate in strict score-first order) and positions the submission as
an eval-stage refinement of already-merged openai#779 and openai#803 rather than a
novel training method.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
himanalot added a commit to himanalot/parameter-golf that referenced this pull request Apr 16, 2026
…g + TTT.

Results on our hardware:
- PR openai#834 (11L, learned routing head + n-gram orders 2-7 + TTT 4ep): 0.1591 BPP
- Their reported: 0.1663 (we got slightly better)
- Eval time: 675s (over 600s budget — torch.compile slower on our hardware)
- PR openai#803 baseline: 0.4377 (complementary training + n-gram order 10)
- PR openai#803 + 14L: 0.4356 (slight improvement from depth)

N-gram progression on our 14L model:
- Order=5 alpha=0.40: 0.9870
- Order=7 alpha=0.55: 0.8977
- Complementary + order=10 alpha=0.75: 0.8264

Next: implement PPM/CTW to replace n-gram backoff, add 14L to PR834 script

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants