FP8 + Arithmetic Coding + SWA (1.1511 BPB) by cruz-andr · Pull Request #538 · openai/parameter-golf

cruz-andr · 2026-03-23T15:05:19Z

Summary

val_bpb: 1.1511 on 8xH100, 10-minute wallclock
FP8 training via TransformerEngine (E4M3 fwd / E5M2 bwd) for ~1.3-1.5x throughput
Custom arithmetic coder replacing zstd-22, using per-tensor empirical histograms approaching Shannon entropy
Early SWA start (step 4500) for more weight averaging during warmdown
TF32 matmul precision for remaining non-FP8 ops
Architecture: 10L, 512dim, MLP 3x, SmearGate + BigramHash(10240)

Run Command

NUM_LAYERS=10 MODEL_DIM=512 MLP_MULT=3 VAL_LOSS_EVERY=0 MAX_WALLCLOCK_SECONDS=600 \
TTT_ENABLED=0 BIGRAM_VOCAB_SIZE=10240 WEIGHT_DECAY=0.04 \
DATA_PATH=/dev/shm/fineweb10B_sp1024/ \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Built On

SOTA submission by thwu1 (1.1428 BPB): Int5-MLP + BigramHash(10240) + SWA
LoRA TTT submission by samacqua (1.1928 BPB)

Three optimizations on top of the current SOTA (1.1428 BPB): 1. FP8 training via TransformerEngine with MuonSafeLinear wrapper for safe Muon optimizer interaction with TE's internal state 2. Custom arithmetic coder replacing zstd-22, using per-tensor empirical histograms for near-Shannon-entropy compression (~1MB savings via parallel multiprocessing decode) 3. LoRA test-time training ported from the TTT submission with per-document adaptation at evaluation time

Hardcoded blocks.8.attn.c_k breaks when NUM_LAYERS is changed via env var. Now uses f-string with Hyperparameters.num_layers - 2.

Decoded arithmetic-coded tensors were stored as quant_result[name] but dequantize_mixed_int6 expects quant_result[name + ".q"].

…l initial_model_state, initial_optimizer_states + gc.collect() at lines 1623-1624, right after warmup completes.

Start SWA at step 4500 instead of relying solely on the LR scale fraction, nearly doubling the number of averaged checkpoints.

- TF32 matmul precision for faster non-FP8 ops - Early SWA start (step 4500) for more weight averaging - Free warmup ghost allocations after restore - Train log added

- Replace ReLU^2 with LeakyReLU(0.5)^2 to preserve negative gradient flow - Scale residuals by 1/sqrt(layer+1) for deeper layer stability - Add per-document cosine LR decay for TTT (0.001 -> 1e-5)

Replace uniform int5/int6 quantization with per-tensor k-means codebook quantization (16 clusters for MLP, 32 for attention). Eliminates per-row FP16 scales in favor of tiny codebooks. Increase magnitude pruning from 3% to 5% for better compression.

A gh pr list search for 'rANS' + 'arithmetic coding' on 2026-04-08 turned up one other rANS-based PR chain in the competition: turbo-indubitable openai#1215 (opened 2026-04-01): 12L LeakyReLU(0.95)^2 + Soft XSA + per-tensor adaptive rANS (int5/int6) val_bpb 1.1601, artifact 15,912,601 bytes and one arithmetic-coding chain (a related but distinct entropy coder): cruz-andr openai#538: FP8 + Arithmetic Coding + SWA, val_bpb 1.1511 So the previous claim 'the only submission in the competition using rANS' is factually wrong. Replace it with what IS actually defensible: - 'First rANS entropy codec for mixed-precision NN weights in the competition' (our parent openai#1123 was opened 2026-03-30, openai#1215 was opened 2026-04-01 -- two days later). - 'One of only two rANS-based PR chains' (this chain + openai#1215). - 'Pentanary MLP-up alphabet (2.32 bits/weight) is the distinctive contribution' -- openai#1215 uses int5/int6-only rANS which cannot go below ~3.0 bits/weight even with optimal frequency tables, while our Pentanary alphabet packs MLP-up at 2.32 bits/weight on 23% of the artifact, which is why 32.8M params fit in 15.56 MB on our side vs 15.91 MB for openai#1215. - 'Phase 1A int6 tied-embedding quant is new in this PR' (replaces the unverifiable 'nobody else quantizes tied lm_head below FP16' claim with a narrower claim we can actually defend: the parent chain stored tied embed as FP16 passthrough, the int6 operating point was established in THIS PR's Phase 1A sweep). - 'Shannon-floor empirical check is the first on the HybridQuant / Pentanary rANS pipeline' (qualified with 'to our knowledge', and the openai#1215 PR does not run a delta-vs-raw entropy comparison -- we checked). All the actual bpb numbers and trick enumeration are unchanged -- this is purely a 'do not overclaim originality' honesty pass. The timeline evidence (openai#1123 opened 2026-03-30 vs openai#1215 opened 2026-04-01) still gives us a clean chronological-first claim, and the Pentanary + HybridQuant mixed-alphabet stack is still a clean technical distinction from openai#1215's int5/int6-only approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T14:27:01Z

Community Review — FP8 + Arithmetic Coding + SWA (1.1511 BPB)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

PR #538 — FP8 + Arithmetic Coding + TTT (1.1511 BPB)

Author: cruz-andr
Head SHA: a0b5478
File: records/track_10min_16mb/2026-03-20_FP8_ArithCoding_TTT/train_gpt.py (1887 lines)

Check 1: N-gram family bug (CLOSE trigger)

Result: CLEAN

The n-gram component is BigramHashEmbedding (class at line 678). The hash function (line 695):

out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod

This mixes t[i] (current input position) and t[i-1] (previous input position) — both are context tokens. Critically, bigram_hash is called inside forward() with input_ids only (line 786), not target_ids. The target is never present in the hash key. This matches the legal BigramHash pattern explicitly exempted from the CLOSE rule.

Check 2: Pre-Quant TTT bug (CLOSE trigger)

Result: CLEAN — Legal score-first TTT

The TTT evaluation is in eval_val_ttt_lora (line 1311). The per-chunk loop (lines 1364–1409) follows this exact order:

Forward pass on chunk ci → produces ptl (per-token loss tensor)
Score accumulation via _accumulate_bpb (lines 1393–1400) — reads current chunk loss into loss_sum/byte_sum before any gradient update
Train step (lines 1402–1409) — only executes when needs_train is True (i.e., not the final chunk), using ptl from the same chunk

Verdict: LOOKS CLEAN — legal TTT implementation.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

cruz-andr added 8 commits March 21, 2026 01:08

Fix FP16_KEEP_NAME_PATTERNS to dynamically adjust for layer count

cc05d1f

Hardcoded blocks.8.attn.c_k breaks when NUM_LAYERS is changed via env var. Now uses f-string with Hyperparameters.num_layers - 2.

Fix KeyError in ac_decompress_model: add missing .q suffix

729aa0a

Decoded arithmetic-coded tensors were stored as quant_result[name] but dequantize_mixed_int6 expects quant_result[name + ".q"].

Added torch.set_float32_matmul_precision("high") at line 38. Added de…

9aa5f10

…l initial_model_state, initial_optimizer_states + gc.collect() at lines 1623-1624, right after warmup completes.

Add SWA_START_STEP=4500 for earlier SWA averaging

f0e9270

Start SWA at step 4500 instead of relying solely on the LR scale fraction, nearly doubling the number of averaged checkpoints.

Update submission: 1.1511 val_bpb with FP8 + ArithCoding + SWA

e9786bd

- TF32 matmul precision for faster non-FP8 ops - Early SWA start (step 4500) for more weight averaging - Free warmup ghost allocations after restore - Train log added

Add LeakyReLU(0.5)^2, depth-scaled residuals, progressive cosine TTT

d21b28a

- Replace ReLU^2 with LeakyReLU(0.5)^2 to preserve negative gradient flow - Scale residuals by 1/sqrt(layer+1) for deeper layer stability - Add per-document cosine LR decay for TTT (0.001 -> 1e-5)

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

sisegod mentioned this pull request Apr 8, 2026

Non-record: v6.2 Phase 5a SOTA-trivial stack (3-seed @76% = 1.136399, -0.010 vs prior; TTT 1.205 not competitive) #1465

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 + Arithmetic Coding + SWA (1.1511 BPB)#538

FP8 + Arithmetic Coding + SWA (1.1511 BPB)#538
cruz-andr wants to merge 8 commits intoopenai:mainfrom
cruz-andr:fp8-arithcoding-ttt

cruz-andr commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cruz-andr commented Mar 23, 2026

Summary

Run Command

Built On

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — FP8 + Arithmetic Coding + SWA (1.1511 BPB)

PR #538 — FP8 + Arithmetic Coding + TTT (1.1511 BPB)

Check 1: N-gram family bug (CLOSE trigger)

Check 2: Pre-Quant TTT bug (CLOSE trigger)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants