Record: BackoffNgramMixer (mean val_bpb=0.6671) by hypery11 · Pull Request #813 · openai/parameter-golf

hypery11 · 2026-03-26T05:18:05Z

Results

Seed	val_bpb	Eval time
42	0.6672	512s
1337	0.6673	~512s
2024	0.6667	~512s
Mean	0.6671
Std	0.0003

Artifact: ~16.0 MB
Train: 600s on 8xH100 SXM
Eval: ~512s (under 600s limit)

Method

11-layer transformer (512d, 8/8 full MHA, XSA-all, LeakyReLU(0.5)^2, 3.5x MLP). BackoffNgramMixer with entropy-adaptive alpha, orders 2-7. Score-first, backward-looking.

8xH100 SXM, train <=600s
Eval <=600s (512s)
Artifact <=16MB
3-seed validation (std 0.0003)

Seeds: 0.6672 / 0.6673 / 0.6667 (std 0.0003). 11L XSA-all 8/8 MHA, BackoffNgramMixer orders 2-7. ~16MB artifact. Train 600s, eval 512s.

Today (2026-03-26) the leaderboard was transformed by eval-time n-gram backoff cache technique. Add comprehensive context for agents: - URGENT_ngram_backoff_breakthrough.md: full implementation guide with NgramEvalCache code, entropy-adaptive alpha, complementary training, priority order for implementation - latest_sota_snapshot.md: updated with new PR landscape - 3 reference code files from top PRs (openai#809 0.295, openai#803 0.442, openai#813 0.667) The n-gram backoff is purely eval-time — adding it to our existing best checkpoint should immediately jump from 1.119 to ~0.67 BPB. Implementing it is now the single highest-priority task. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…(legality review) - SOTA target is now PR openai#803: Complementary Training + Backoff N-gram + TTT - PR openai#809 (0.2952) excluded pending legality review - research_memory.md: fix Working SOTA Anchor section (agent had written it to explicitly ignore the URGENT file and stick to 1.1194 — removed that) - All PR openai#809 references updated to PR openai#803/openai#813 - Dashboard: SOTA now 0.4416, gap 0.681 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T13:35:08Z

Community Review — Record: BackoffNgramMixer (mean val_bpb=0.6671)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

PR #813 — BackoffNgramMixer — Audit Head SHA: `9681865` File audited: `records/track_10min_16mb/2026-03-26_BackoffNgramMixer/train_gpt.py` --- ### Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key, NOT BigramHash) PASS — no bug. `BackoffNgramMixer.update()` (lines 66–77) builds `full_key = ((ctx_hash ^ (tgt * self.primes[cw])) & mask)` — the target IS XOR'd into the combined key. This is the correct n-gram counting pattern: `full_counts[ctx+target]` and `ctx_counts[context]` are maintained separately. At query time in `mix_and_score()` (line 113), the same formula is used to look up `full_counts[ctx+target] / ctx_counts[context]`, which correctly computes P(target|context). No future-token leakage: the lookup uses the actual `y_np` target, which is the token being scored — this is legal n-gram scoring, not a hash-key contamination bug. `BigramHashEmbedding` (lines 607–630) XORs only adjacent context tokens (`t[..., 1:]` and `t[..., :-1]`), no target tokens. Clean. --- ### Check 2: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) PASS — not present. Training on `val_tokens` only occurs in Phase 2 of `eval_val_sliding_ttt` (lines 1083–1143), which executes strictly after Phase 1 scoring for the same chunk. There is no standalone multi-epoch loop over `val_tokens` outside this function. --- ### Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) PRESENT — legal. `eval_val_sliding_ttt` (lines 895–1166) implements the required pattern: - Phase 1 (score): lines 1010–1070 — chunk windows scored under `torch.inference_mode()`, losses accumulated into `loss_sum`. - Phase 2 (train): lines 1083–1143 — training on the same chunk's tokens, gated by `is_last_chunk = (ci == num_chunks - 1)` (line 1084). The last chunk is scored...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Record: BackoffNgramMixer (mean val_bpb=0.6671, 3 seeds)

9681865

Seeds: 0.6672 / 0.6673 / 0.6667 (std 0.0003). 11L XSA-all 8/8 MHA, BackoffNgramMixer orders 2-7. ~16MB artifact. Train 600s, eval 512s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: BackoffNgramMixer (mean val_bpb=0.6671)#813

Record: BackoffNgramMixer (mean val_bpb=0.6671)#813
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-26_champion_final

hypery11 commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hypery11 commented Mar 26, 2026

Results

Method

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Record: BackoffNgramMixer (mean val_bpb=0.6671)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants