Skip to content

Record: BackoffNgramMixer (mean val_bpb=0.6671)#813

Open
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-26_champion_final
Open

Record: BackoffNgramMixer (mean val_bpb=0.6671)#813
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-26_champion_final

Conversation

@hypery11
Copy link
Copy Markdown

Results

Seed val_bpb Eval time
42 0.6672 512s
1337 0.6673 ~512s
2024 0.6667 ~512s
Mean 0.6671
Std 0.0003
  • Artifact: ~16.0 MB
  • Train: 600s on 8xH100 SXM
  • Eval: ~512s (under 600s limit)

Method

11-layer transformer (512d, 8/8 full MHA, XSA-all, LeakyReLU(0.5)^2, 3.5x MLP). BackoffNgramMixer with entropy-adaptive alpha, orders 2-7. Score-first, backward-looking.

  • 8xH100 SXM, train <=600s
  • Eval <=600s (512s)
  • Artifact <=16MB
  • 3-seed validation (std 0.0003)

Seeds: 0.6672 / 0.6673 / 0.6667 (std 0.0003).
11L XSA-all 8/8 MHA, BackoffNgramMixer orders 2-7.
~16MB artifact. Train 600s, eval 512s.
XinghanLi66 added a commit to XinghanLi66/parameter-golf that referenced this pull request Mar 26, 2026
Today (2026-03-26) the leaderboard was transformed by eval-time n-gram
backoff cache technique. Add comprehensive context for agents:

- URGENT_ngram_backoff_breakthrough.md: full implementation guide with
  NgramEvalCache code, entropy-adaptive alpha, complementary training,
  priority order for implementation
- latest_sota_snapshot.md: updated with new PR landscape
- 3 reference code files from top PRs (openai#809 0.295, openai#803 0.442, openai#813 0.667)

The n-gram backoff is purely eval-time — adding it to our existing best
checkpoint should immediately jump from 1.119 to ~0.67 BPB.
Implementing it is now the single highest-priority task.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
XinghanLi66 added a commit to XinghanLi66/parameter-golf that referenced this pull request Mar 26, 2026
…(legality review)

- SOTA target is now PR openai#803: Complementary Training + Backoff N-gram + TTT
- PR openai#809 (0.2952) excluded pending legality review
- research_memory.md: fix Working SOTA Anchor section (agent had written it
  to explicitly ignore the URGENT file and stick to 1.1194 — removed that)
- All PR openai#809 references updated to PR openai#803/openai#813
- Dashboard: SOTA now 0.4416, gap 0.681

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: BackoffNgramMixer (mean val_bpb=0.6671)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

PR #813 — BackoffNgramMixer — Audit Head SHA: 9681865 File audited: records/track_10min_16mb/2026-03-26_BackoffNgramMixer/train_gpt.py --- ### Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key, NOT BigramHash) PASS — no bug. BackoffNgramMixer.update() (lines 66–77) builds full_key = ((ctx_hash ^ (tgt * self.primes[cw])) & mask) — the target IS XOR'd into the combined key. This is the correct n-gram counting pattern: full_counts[ctx+target] and ctx_counts[context] are maintained separately. At query time in mix_and_score() (line 113), the same formula is used to look up full_counts[ctx+target] / ctx_counts[context], which correctly computes P(target|context). No future-token leakage: the lookup uses the actual y_np target, which is the token being scored — this is legal n-gram scoring, not a hash-key contamination bug. BigramHashEmbedding (lines 607–630) XORs only adjacent context tokens (t[..., 1:] and t[..., :-1]), no target tokens. Clean. --- ### Check 2: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) PASS — not present. Training on val_tokens only occurs in Phase 2 of eval_val_sliding_ttt (lines 1083–1143), which executes strictly after Phase 1 scoring for the same chunk. There is no standalone multi-epoch loop over val_tokens outside this function. --- ### Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) PRESENT — legal. eval_val_sliding_ttt (lines 895–1166) implements the required pattern: - Phase 1 (score): lines 1010–1070 — chunk windows scored under torch.inference_mode(), losses accumulated into loss_sum. - Phase 2 (train): lines 1083–1143 — training on the same chunk's tokens, gated by is_last_chunk = (ci == num_chunks - 1) (line 1084). The last chunk is scored...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants