Skip to content

Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H20…#2088

Open
MaxIv25 wants to merge 1 commit intoopenai:mainfrom
MaxIv25:bigram-blend-nonrecord
Open

Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H20…#2088
MaxIv25 wants to merge 1 commit intoopenai:mainfrom
MaxIv25:bigram-blend-nonrecord

Conversation

@MaxIv25
Copy link
Copy Markdown

@MaxIv25 MaxIv25 commented May 1, 2026

Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H200, 3-seed)

Summary

This submission introduces Causal Bigram Blending — a zero-cost eval-time technique that blends neural model log-probabilities with an online causal bigram prior, yielding a consistent ~0.011 BPB improvement at no additional training cost or artifact size increase.

Architecture

Base architecture is the PR #1855 / #1868 lineage:

  • 11-layer, 512-dim recurrent transformer with U-Net skips and parallel residuals (layers 8+)
  • Partial RoPE (16 dims), logit softcap=30, tied embeddings
  • Polar Express Newton-Schulz Muon optimizer (5-step backend)
  • GPTQ int6 (attn+mlp) + int7 (embeddings) + LQER asymmetric int4 rank-4
  • SmearGate + Sparse Attention Gate
  • Brotli compression

Novel Technique: Causal Bigram Blending

At eval time, we maintain a running bigram count matrix P(next_token | prev_token) that is updated after each batch is scored (score-before-update = compliant with competition rules).

For each token position, the model's log-probabilities are blended with the bigram prior:

blended = logaddexp((1 - λ·c) · model_logprobs, λ·c · bigram_logprobs)

Where:

  • λ = 0.03 — blend strength
  • c = count / (count + 10) — adaptive confidence (0→1 as observations grow)
  • Bigram log-probs use Laplace smoothing: log((count + 1) / (total + V))

Key properties:

  • ✅ Causal: counts updated only after scoring each batch
  • ✅ Zero training cost: applied only during evaluation
  • ✅ No artifact size increase: no additional model parameters
  • ✅ Deterministic: no stochastic sampling or randomness

Results

1×H200, 5000 steps, 3-seed

Trained without CaseOps, 5/128 training shards (subset of FineWeb-10B SP8192).

Seed val_bpb (raw) post-EMA quantized quantized+TTT
0 1.0788 1.0652 1.0750 1.0744
42 1.0791 1.0654 1.0750 1.0744
314 1.0787 1.0650 1.0749 1.0744
mean 1.0789 1.0652 1.0750 1.0744
std 0.0002 0.0002 0.0001 0.0000

TTT eval time: ~3150s per seed (eager mode, no torch.compile). Artifact size: ~16,148–16,151 KB.

Ablation: Bigram Blend ON vs OFF (1×H200, 3000 steps)

Controlled comparison with identical training — no TTT, no CaseOps, 5/128 training shards (subset of FineWeb-10B SP8192).

Metric Baseline (no bigram) + Bigram Blend (λ=0.03) Δ
val_bpb (raw) 1.1014 1.0899 −0.0115
post-EMA val_bpb 1.0932 1.0818 −0.0114
quantized val_bpb 1.1015 1.0901 −0.0114

Both runs use the same architecture, optimizer, and hyperparameters. The only difference is BIGRAM_BLEND_ENABLED=1 at eval time. Training is unaffected — bigram blending is applied exclusively during validation scoring.

8×H100 RunPod, 600s wallclock (1 seed, no TTT)

Metric Value
Steps completed 4718 / 20000
val_bpb (raw + bigram) 1.0841
post-EMA val_bpb (+ bigram) 1.0741
quantized val_bpb (+ bigram) 1.0827
Artifact size 16,141,590 B

Reproduction

# Single GPU (H200/H100/A100)
BIGRAM_BLEND_ENABLED=1 \
BIGRAM_BLEND_LAMBDA=0.03 \
ITERATIONS=5000 \
SEED=42 \
TTT_ENABLED=1 \
SMEAR_GATE_ENABLED=1 \
SPARSE_ATTN_GATE_ENABLED=1 \
MIN_LR=0.1 \
EMBED_BITS=7 \
MLP_CLIP_SIGMAS=11.5 \
EMBED_CLIP_SIGMAS=14.0 \
QK_GAIN_INIT=5.25 \
COMPRESSOR=brotli \
GPTQ_CALIBRATION_BATCHES=16 \
python train_gpt_sota_exp.py

# 8×H100 cluster
BIGRAM_BLEND_ENABLED=1 \
BIGRAM_BLEND_LAMBDA=0.03 \
MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 train_gpt_sota_exp.py

Files

  • train_gpt_sota_exp.py — training script with Causal Bigram Blending (eval-time, lines 2792–2864)
  • train_h200_seed{42,314,0}.log — 3-seed training logs on 1×H200

Compliance Notes

  • Score-before-update: Bigram counts are updated after each batch is scored, ensuring causal compliance
  • No external data: Bigram statistics are computed from validation data during eval
  • Deterministic: No stochastic elements in the blending
  • Self-contained: No additional files or dependencies required

Built Upon

This work builds on the following PRs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant