Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H20… by MaxIv25 · Pull Request #2088 · openai/parameter-golf

MaxIv25 · 2026-05-01T04:01:24Z

Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H200, 3-seed)

Summary

This submission introduces Causal Bigram Blending — a zero-cost eval-time technique that blends neural model log-probabilities with an online causal bigram prior, yielding a consistent ~0.011 BPB improvement at no additional training cost or artifact size increase.

Architecture

Base architecture is the PR #1855 / #1868 lineage:

11-layer, 512-dim recurrent transformer with U-Net skips and parallel residuals (layers 8+)
Partial RoPE (16 dims), logit softcap=30, tied embeddings
Polar Express Newton-Schulz Muon optimizer (5-step backend)
GPTQ int6 (attn+mlp) + int7 (embeddings) + LQER asymmetric int4 rank-4
SmearGate + Sparse Attention Gate
Brotli compression

Novel Technique: Causal Bigram Blending

At eval time, we maintain a running bigram count matrix P(next_token | prev_token) that is updated after each batch is scored (score-before-update = compliant with competition rules).

For each token position, the model's log-probabilities are blended with the bigram prior:

blended = logaddexp((1 - λ·c) · model_logprobs, λ·c · bigram_logprobs)

Where:

λ = 0.03 — blend strength
c = count / (count + 10) — adaptive confidence (0→1 as observations grow)
Bigram log-probs use Laplace smoothing: log((count + 1) / (total + V))

Key properties:

✅ Causal: counts updated only after scoring each batch
✅ Zero training cost: applied only during evaluation
✅ No artifact size increase: no additional model parameters
✅ Deterministic: no stochastic sampling or randomness

Results

1×H200, 5000 steps, 3-seed

Trained without CaseOps, 5/128 training shards (subset of FineWeb-10B SP8192).

Seed	val_bpb (raw)	post-EMA	quantized	quantized+TTT
0	1.0788	1.0652	1.0750	1.0744
42	1.0791	1.0654	1.0750	1.0744
314	1.0787	1.0650	1.0749	1.0744
mean	1.0789	1.0652	1.0750	1.0744
std	0.0002	0.0002	0.0001	0.0000

TTT eval time: ~3150s per seed (eager mode, no torch.compile). Artifact size: ~16,148–16,151 KB.

Ablation: Bigram Blend ON vs OFF (1×H200, 3000 steps)

Controlled comparison with identical training — no TTT, no CaseOps, 5/128 training shards (subset of FineWeb-10B SP8192).

Metric	Baseline (no bigram)	+ Bigram Blend (λ=0.03)	Δ
val_bpb (raw)	1.1014	1.0899	−0.0115
post-EMA val_bpb	1.0932	1.0818	−0.0114
quantized val_bpb	1.1015	1.0901	−0.0114

Both runs use the same architecture, optimizer, and hyperparameters. The only difference is BIGRAM_BLEND_ENABLED=1 at eval time. Training is unaffected — bigram blending is applied exclusively during validation scoring.

8×H100 RunPod, 600s wallclock (1 seed, no TTT)

Metric	Value
Steps completed	4718 / 20000
val_bpb (raw + bigram)	1.0841
post-EMA val_bpb (+ bigram)	1.0741
quantized val_bpb (+ bigram)	1.0827
Artifact size	16,141,590 B

Reproduction

# Single GPU (H200/H100/A100)
BIGRAM_BLEND_ENABLED=1 \
BIGRAM_BLEND_LAMBDA=0.03 \
ITERATIONS=5000 \
SEED=42 \
TTT_ENABLED=1 \
SMEAR_GATE_ENABLED=1 \
SPARSE_ATTN_GATE_ENABLED=1 \
MIN_LR=0.1 \
EMBED_BITS=7 \
MLP_CLIP_SIGMAS=11.5 \
EMBED_CLIP_SIGMAS=14.0 \
QK_GAIN_INIT=5.25 \
COMPRESSOR=brotli \
GPTQ_CALIBRATION_BATCHES=16 \
python train_gpt_sota_exp.py

# 8×H100 cluster
BIGRAM_BLEND_ENABLED=1 \
BIGRAM_BLEND_LAMBDA=0.03 \
MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 train_gpt_sota_exp.py

Files

train_gpt_sota_exp.py — training script with Causal Bigram Blending (eval-time, lines 2792–2864)
train_h200_seed{42,314,0}.log — 3-seed training logs on 1×H200

Compliance Notes

Score-before-update: Bigram counts are updated after each batch is scored, ensuring causal compliance
No external data: Bigram statistics are computed from validation data during eval
Deterministic: No stochastic elements in the blending
Self-contained: No additional files or dependencies required

Built Upon

This work builds on the following PRs:

PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 (bigbag, Apr 9) — SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT (1.0810). Base architecture.
PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 (nprime06, Apr 23) — Polar Express Newton-Schulz + MIN_LR + SparseAttnGate + FusedCE (1.0634). Optimizer and attention improvements.
PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 (codemath3000, Apr 27) — BOS-Fixed SmearGate + LQER + SparseAttnGate + 9-Hparam Stack (1.0611). Current merged SOTA — used as the foundation for train_gpt_sota_exp.py.

…0, 3-seed)

Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H20…

352f562

…0, 3-seed)

MaxIv25 mentioned this pull request May 1, 2026

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis #2102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H20…#2088

Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H20…#2088
MaxIv25 wants to merge 1 commit intoopenai:mainfrom
MaxIv25:bigram-blend-nonrecord

MaxIv25 commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxIv25 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H200, 3-seed)

Summary

Architecture

Novel Technique: Causal Bigram Blending

Results

1×H200, 5000 steps, 3-seed

Ablation: Bigram Blend ON vs OFF (1×H200, 3000 steps)

8×H100 RunPod, 600s wallclock (1 seed, no TTT)

Reproduction

Files

Compliance Notes

Built Upon

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxIv25 commented May 1, 2026 •

edited

Loading