Submission: DominationV2 + BOS-Reset Bigram Cache + TTT (val_bpb=1.1382, 3-seed mean) by shouryamaanjain · Pull Request #958 · openai/parameter-golf

shouryamaanjain · 2026-03-27T14:07:54Z

DominationV2 + BOS-Reset Bigram Cache + TTT

val_bpb: 1.1382 (3-seed mean, std 0.0010) | ~15.5 MB | 8xH100 SXM

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	step_avg	steps	val_bpb	Artifact
1337	69.7ms	8,611	1.1371	15,504,722
42	69.8ms	8,605	1.1385	15,579,418
2025	69.7ms	8,621	1.1389	15,505,762
Mean	69.7ms	8,612	1.1382

Timing Budget

Phase	Time
Training (8,611 steps @ 69.7ms)	600s
TTT (3 epochs)	~10s
Sliding window + cache eval	~223s
Total eval	~233s

BOS-Reset Bigram Cache

An eval-time bigram cache applied during sliding window evaluation, after quantization roundtrip and TTT.

For each scored token, the cache tracks bigram counts from already-scored tokens within the current document and blends with model probabilities:

p_final = (1 - alpha_eff) * p_model + alpha_eff * p_cache

p_cache  = count(prev, target) / count(prev)
alpha_eff = 0.20 * count / (count + 8)        scales with observed data
alpha_eff *= (entropy / max_entropy)           higher when model is uncertain

Cache resets at every BOS token (document boundary). Updated only after each token is scored (score-first, same ordering as TTT in PR #549).

Architecture

DominationV2 stack:

Component	Setting
Layers	11 (512d, 8H, 4KV)
MLP	3x relu²
U-Net	5 encoder + 6 decoder with skip connections
XSA	Last 4 layers
SmearGate	Per-dimension blend with previous token
BigramHash	2048 buckets, dim=128
OrthoInit	Orthogonal init with depth scaling
EMA	Decay=0.997
Quantization	Mixed int6/int8 + zstd-22
TTT	3 epochs, lr=1e-4

Cache Settings

Parameter	Value
CACHE_ALPHA	0.20
CACHE_TAU	8.0
CACHE_ENTROPY_POWER	1.0
Eval stride	64

Run Command

python3 data/cached_challenge_fineweb.py --variant sp1024
pip install zstandard

cd records/track_10min_16mb/2026-03-27_DominationV2_BigramCache_TTT

DATA_PATH=../../data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=../../data/tokenizers/fineweb_1024_bpe.model \
SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

DominationV2 base: built on upstream PR Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds) #64 and PR 11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318) #198
Bigram cache: inspired by classical cache language models (Grave et al., 2016)
TTT: adapted from PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461

…3-seed mean)

MatoTeziTanka · 2026-04-11T20:07:51Z

Community Review — Submission: DominationV2 + BOS-Reset Bigram Cache + TTT (val_bpb=1.1382, 3-seed mean)

BPB: 1.1382 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 2eb49c166a64, file records/track_10min_16mb/2026-03-27_DominationV2_BigramCache_TTT/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=55299 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=55299 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Record: DominationV2 + BOS-Reset Bigram Cache + TTT (val_bpb=1.1382, …

2eb49c1

…3-seed mean)

shouryamaanjain changed the title ~~Record: DominationV2 + BOS-Reset Bigram Cache + TTT (val_bpb=1.1382, 3-seed mean)~~ Submission: DominationV2 + BOS-Reset Bigram Cache + TTT (val_bpb=1.1382, 3-seed mean) Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission: DominationV2 + BOS-Reset Bigram Cache + TTT (val_bpb=1.1382, 3-seed mean)#958

Submission: DominationV2 + BOS-Reset Bigram Cache + TTT (val_bpb=1.1382, 3-seed mean)#958
shouryamaanjain wants to merge 1 commit intoopenai:mainfrom
shouryamaanjain:submission/domv2-bigram-cache-ttt

shouryamaanjain commented Mar 27, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shouryamaanjain commented Mar 27, 2026

DominationV2 + BOS-Reset Bigram Cache + TTT

Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

Timing Budget

BOS-Reset Bigram Cache

Architecture

Cache Settings

Run Command

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Submission: DominationV2 + BOS-Reset Bigram Cache + TTT (val_bpb=1.1382, 3-seed mean)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants