Record: NgramRes + Sliding-Window Attention + Legal Score-First TTT. … by ghrua · Pull Request #1834 · openai/parameter-golf

ghrua · 2026-04-26T14:17:26Z

Record: NgramRes + Sliding-Window Attention + Legal Score-First TTT

val_bpb = 1.08034 (3-seed mean, std 0.00034) | ~15.99 MB | 8xH100 SXM

3-Seed Results

Seed	Sliding BPB	TTT BPB	Code stub	Model	Total
42	1.08173	1.08039	19,940	15,966,406	15,986,346
314	1.08211	1.08064	19,940	15,967,359	15,987,299
999	1.08137	1.07997	19,940	15,971,159	15,991,099
Mean	1.08174	1.08034	19,940	15,968,308	15,988,248
Std	0.00037	0.00034	—	2,515	2,515

Apr-9 record reference (track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT): val_bpb = 1.0810.

Key Techniques

We decomposed the language modeling into two components: an n-gram model for local context, and a residual model for long context:

logits(y | x) = α(t) · ngram(x)  +  (1 − α(t)) · residual(x)
loss          = cross_entropy( logits, y )

The basic idea is from [1], i.e., spending effort on learning the knowledge that can be cheaply captured by n-gram (the local context) seems a waste.

Neural Ngram Model — a small 3-gram MLP (2 layers, d_hidden = 64, d_embed = 64) reads the same input embeddings, produces logits via the tied output projection, and is mixed into the main logits with α = 0.3. To save memory, the head shares both the input embedding (NGRAM_SHARE_EMB=1) and a per-position pad embedding, and ties its output to tok_emb.weight (NGRAM_TIE_OUTPUT=1). Adds ~0.6 M params (~4 % of the model); int6-quantized identically to the rest of the matrices.
Sliding-Window Attention on layers 0-3 — flash_attn_3_func(window_size=(512, 0)) on the first four blocks; layers 4-10 keep full causal attention (SWA_LAST_EARLY_LAYER=4, SWA_WINDOW=512). Frees attention compute on the early layers that handle local syntax, leaving more wallclock for the rest of the stack.
Legal Score-First TTT — SGD (lr = 0.005, momentum = 0.9), 3 epochs per 32 K-token chunk, gradient clip 1.0. Score-before-update ordering preserved. Same legal framework as the Apr-9 record (PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493).
GPTQ int6 + int8 embeddings + Brotli — full-Hessian GPTQ with SDClip (k = 12.85 for matrices, k = 20.0 for embeddings). The NgramRes head matrices are included in the int6 quantize set. Code shipped as a 2-line LZMA + base85 stub (~20 KB).

…3-seed Mean 1.08034 (std 0.00034), seeds 42/314/999. All artifacts under 16MB, training under 600s, eval under 600s Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT

…1835 PPM-D 1.00136 new watch; NgramRes stackable; Day 17 plateau; Session 22 - Upstream commit 7427de2 (Alex Zhao, OpenAI Apr 26): Scylla 0.9485 (PR openai#1184) removed as invalid record; PR openai#1813 (djeidy Scylla 0.94166) effectively dead by proxy - PR openai#1835 (anmarhindi, 1.00136): PPM-D order-5 byte mixture, binary-λ gate, score-first, 15,993,020 bytes — most credible extraordinary claim yet; wait 24h for community BPB check - PR openai#1834 (ghrua, 1.08034): NgramRes 3-gram MLP +0.6M params + sliding-window attn layers 0-3 — modest, stackable - PR openai#731 (Hedge Mixer): still OPEN, 2 seeds pending, no merge - Merged SOTA 1.0810 definitively confirmed; target ≤1.0760; 4 days to deadline https://claude.ai/code/session_01XbdTRT7zPHoGp3LfQV4yXF

h1beee · 2026-04-27T04:37:10Z

your 1.08034 does not beat the current leader (2026-04-09, bpb 1.0810) by 0.005 nats

see: https://github.com/openai/parameter-golf#submission-process

Record: NgramRes + Sliding-Window Attention + Legal Score-First TTT. …

a962a4d

…3-seed Mean 1.08034 (std 0.00034), seeds 42/314/999. All artifacts under 16MB, training under 600s, eval under 600s Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT

ghrua closed this Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: NgramRes + Sliding-Window Attention + Legal Score-First TTT. …#1834

Record: NgramRes + Sliding-Window Attention + Legal Score-First TTT. …#1834
ghrua wants to merge 1 commit intoopenai:mainfrom
ghrua:main

ghrua commented Apr 26, 2026

Uh oh!

h1beee commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ghrua commented Apr 26, 2026

Record: NgramRes + Sliding-Window Attention + Legal Score-First TTT

3-Seed Results

Key Techniques

Uh oh!

h1beee commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

h1beee commented Apr 27, 2026 •

edited

Loading