Skip to content

FP8 + Arithmetic Coding + SWA (1.1511 BPB)#538

Open
cruz-andr wants to merge 8 commits intoopenai:mainfrom
cruz-andr:fp8-arithcoding-ttt
Open

FP8 + Arithmetic Coding + SWA (1.1511 BPB)#538
cruz-andr wants to merge 8 commits intoopenai:mainfrom
cruz-andr:fp8-arithcoding-ttt

Conversation

@cruz-andr
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1511 on 8xH100, 10-minute wallclock
  • FP8 training via TransformerEngine (E4M3 fwd / E5M2 bwd) for ~1.3-1.5x throughput
  • Custom arithmetic coder replacing zstd-22, using per-tensor empirical histograms approaching Shannon entropy
  • Early SWA start (step 4500) for more weight averaging during warmdown
  • TF32 matmul precision for remaining non-FP8 ops
  • Architecture: 10L, 512dim, MLP 3x, SmearGate + BigramHash(10240)

Run Command

NUM_LAYERS=10 MODEL_DIM=512 MLP_MULT=3 VAL_LOSS_EVERY=0 MAX_WALLCLOCK_SECONDS=600 \
TTT_ENABLED=0 BIGRAM_VOCAB_SIZE=10240 WEIGHT_DECAY=0.04 \
DATA_PATH=/dev/shm/fineweb10B_sp1024/ \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Built On

  • SOTA submission by thwu1 (1.1428 BPB): Int5-MLP + BigramHash(10240) + SWA
  • LoRA TTT submission by samacqua (1.1928 BPB)

Three optimizations on top of the current SOTA (1.1428 BPB):

1. FP8 training via TransformerEngine with MuonSafeLinear wrapper
   for safe Muon optimizer interaction with TE's internal state

2. Custom arithmetic coder replacing zstd-22, using per-tensor
   empirical histograms for near-Shannon-entropy compression
   (~1MB savings via parallel multiprocessing decode)

3. LoRA test-time training ported from the TTT submission with
   per-document adaptation at evaluation time
Hardcoded blocks.8.attn.c_k breaks when NUM_LAYERS is changed via
env var. Now uses f-string with Hyperparameters.num_layers - 2.
Decoded arithmetic-coded tensors were stored as quant_result[name]
but dequantize_mixed_int6 expects quant_result[name + ".q"].
…l initial_model_state, initial_optimizer_states + gc.collect() at lines

  1623-1624, right after warmup completes.
Start SWA at step 4500 instead of relying solely on the LR
scale fraction, nearly doubling the number of averaged checkpoints.
- TF32 matmul precision for faster non-FP8 ops
- Early SWA start (step 4500) for more weight averaging
- Free warmup ghost allocations after restore
- Train log added
- Replace ReLU^2 with LeakyReLU(0.5)^2 to preserve negative gradient flow
- Scale residuals by 1/sqrt(layer+1) for deeper layer stability
- Add per-document cosine LR decay for TTT (0.001 -> 1e-5)
Replace uniform int5/int6 quantization with per-tensor k-means
codebook quantization (16 clusters for MLP, 32 for attention).
Eliminates per-row FP16 scales in favor of tiny codebooks.
Increase magnitude pruning from 3% to 5% for better compression.
sisegod added a commit to sisegod/parameter-golf that referenced this pull request Apr 8, 2026
A gh pr list search for 'rANS' + 'arithmetic coding' on 2026-04-08
turned up one other rANS-based PR chain in the competition:

  turbo-indubitable openai#1215 (opened 2026-04-01):
    12L LeakyReLU(0.95)^2 + Soft XSA + per-tensor adaptive rANS (int5/int6)
    val_bpb 1.1601, artifact 15,912,601 bytes

and one arithmetic-coding chain (a related but distinct entropy coder):

  cruz-andr openai#538: FP8 + Arithmetic Coding + SWA, val_bpb 1.1511

So the previous claim 'the only submission in the competition using rANS'
is factually wrong. Replace it with what IS actually defensible:

  - 'First rANS entropy codec for mixed-precision NN weights in the
    competition' (our parent openai#1123 was opened 2026-03-30, openai#1215 was
    opened 2026-04-01 -- two days later).
  - 'One of only two rANS-based PR chains' (this chain + openai#1215).
  - 'Pentanary MLP-up alphabet (2.32 bits/weight) is the distinctive
    contribution' -- openai#1215 uses int5/int6-only rANS which cannot go
    below ~3.0 bits/weight even with optimal frequency tables, while
    our Pentanary alphabet packs MLP-up at 2.32 bits/weight on 23% of
    the artifact, which is why 32.8M params fit in 15.56 MB on our
    side vs 15.91 MB for openai#1215.
  - 'Phase 1A int6 tied-embedding quant is new in this PR' (replaces
    the unverifiable 'nobody else quantizes tied lm_head below FP16'
    claim with a narrower claim we can actually defend: the parent
    chain stored tied embed as FP16 passthrough, the int6 operating
    point was established in THIS PR's Phase 1A sweep).
  - 'Shannon-floor empirical check is the first on the HybridQuant /
    Pentanary rANS pipeline' (qualified with 'to our knowledge', and
    the openai#1215 PR does not run a delta-vs-raw entropy comparison -- we
    checked).

All the actual bpb numbers and trick enumeration are unchanged -- this
is purely a 'do not overclaim originality' honesty pass. The timeline
evidence (openai#1123 opened 2026-03-30 vs openai#1215 opened 2026-04-01) still
gives us a clean chronological-first claim, and the Pentanary +
HybridQuant mixed-alphabet stack is still a clean technical
distinction from openai#1215's int5/int6-only approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — FP8 + Arithmetic Coding + SWA (1.1511 BPB)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

PR #538 — FP8 + Arithmetic Coding + TTT (1.1511 BPB)

Author: cruz-andr
Head SHA: a0b5478
File: records/track_10min_16mb/2026-03-20_FP8_ArithCoding_TTT/train_gpt.py (1887 lines)


Check 1: N-gram family bug (CLOSE trigger)

Result: CLEAN

The n-gram component is BigramHashEmbedding (class at line 678). The hash function (line 695):

out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod

This mixes t[i] (current input position) and t[i-1] (previous input position) — both are context tokens. Critically, bigram_hash is called inside forward() with input_ids only (line 786), not target_ids. The target is never present in the hash key. This matches the legal BigramHash pattern explicitly exempted from the CLOSE rule.


Check 2: Pre-Quant TTT bug (CLOSE trigger)

Result: CLEAN — Legal score-first TTT

The TTT evaluation is in eval_val_ttt_lora (line 1311). The per-chunk loop (lines 1364–1409) follows this exact order:

  1. Forward pass on chunk ci → produces ptl (per-token loss tensor)
  2. Score accumulation via _accumulate_bpb (lines 1393–1400) — reads current chunk loss into loss_sum/byte_sum before any gradient update
  3. Train step (lines 1402–1409) — only executes when needs_train is True (i.e., not the final chunk), using ptl from the same chunk

Verdict: LOOKS CLEAN — legal TTT implementation.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants