Record(?): WARP (Word-Aware Representation Priors) — val_bpb 1.0713 | 1xH100 10min | 13.65 MB by ahmetdenizyilmaz · Pull Request #1252 · openai/parameter-golf

ahmetdenizyilmaz · 2026-04-02T08:33:33Z

WARP (Word-Aware Representation Priors) — val_bpb 1.0713

val_bpb = 1.0713 | 13.65 MB | 1xH100 SXM, 600s wallclock

Note: I am not sure if this qualifies as a record since it was trained on a single H100 rather than the standard 8xH100 track. I would appreciate guidance from maintainers on whether this is eligible or if 8xH100 verification is required.

This method is part of an in-progress paper (WARP: Word-Aware Representation Priors for Subword Language Models) currently being written up. The results exceeded my expectations significantly and I honestly suspect there may be an error somewhere — I would welcome the community's help in verifying or finding issues.

Results

Metric	Value
Pre-quant	1.1093
Post-EMA (disabled)	1.1093
Post-GPTQ int6	1.1152
Sliding window (stride=64)	1.1030
Legal TTT (2 epochs, freeze 2)	1.0952
Context-Only SLOT (lr=0.005, 8 steps)	1.0713
Artifact size	13,653,989 bytes
Training steps	1,260
Step avg	468ms

Key Innovation: WARP (Word-Aware Representation Priors)

BPE tokenization destroys word boundary information that the model must re-learn through attention. WARP restores this information at three injection points with only 183,820 additional parameters (0.7% overhead):

WARP-Len (6,657 params) — Word length embedding at layer 0. Each token receives an embedding based on how many BPE tokens its word contains. Injected before RMSNorm, before any attention.

WARP-Pos (1,035 params) — Word position bias in Q and K. Learned per-layer embeddings based on within-word position (0-7), applied to both queries and keys before RoPE. Shared across all 11 layers with per-layer learned scales.

WARP-Type (176,128 params) — Word type logit bias at output. A 2-layer classifier (512->192->64 types) produces soft type probabilities. These are multiplied with a learned type-vocabulary bias matrix (64x1024) and added directly to logits before softcap. No auxiliary loss — gradient flows from cross-entropy through the bias to the classifier.

All three modules share compute_word_boundary_maps() which detects word starts from SentencePiece's leading-space convention using only token IDs. Fully torch.compile compatible.

Additional Findings

EMA disabled for short runs: EMA (beta=0.997) degrades quality by +0.045 BPB at ~1260 steps by averaging early bad weights with final good weights. Crossover where EMA helps: ~3000+ steps. Disabling EMA was the single largest improvement.
SLOT dtype fix: type_vocab_bias.float() required to prevent bf16/fp32 mismatch during SLOT's compute_logits call.

Architecture

Built on the stack from the LeakyReLU-squared + Legal TTT + Parallel Muon submission (by @abaybektursun) with the following modifications:

Component	Setting
Base	11L, 512d, 8H/4KV GQA, LeakyReLU(0.5) squared 3x
BigramHash	2816 buckets (kept)
XSA	All 11 layers
SmearGate	Enabled
ValueEmbedding	Removed (freed params for WARP-Type)
WARP-Len	13x512 embed + learned scale
WARP-Pos	8x64 Q + 8x64 K + 11 layer scales
WARP-Type	64 types, 192-dim hidden, direct logit bias
Optimizer	Parallel Muon + Adam split
EMA	Disabled (beta=0.0)
GPTQ	Enabled (64 calibration batches)
TTT	2 epochs, freeze_blocks=2, score-first
SLOT	lr=0.005, 8 steps, context-only

Legality

Training: Standard transformer training with architectural modifications (WARP modules). No validation data accessed. GPTQ calibrates on training data.

TTT: Score-first protocol — each chunk scored under torch.inference_mode() before any weight update. Same legal pattern as the LeakyReLU-squared + Legal TTT submission by @abaybektursun.

SLOT: Context-only variant — delta optimized on positions 0-1983 (already-scored context). New tokens (1984-2047) excluded from loss and contribute zero gradient. Based on the MuonEq-R + Context-Only SLOT submission by @abaybektursun.

WARP: Purely architectural. Word boundaries derived from token IDs via SentencePiece vocabulary lookup table. No external data, tools, or target token access.

Note on Hardware

This result was obtained on a single H100 (1xH100, 600s, ~1260 steps). On 8xH100 (the standard track hardware), the model would train ~7185 steps in the same wallclock time, yielding significantly better results. The single-GPU result represents a conservative lower bound.

I also ran a longer 50-minute training (6436 steps, matching 8xH100 compute) which achieved 0.9766 post-SLOT, but the artifact was 16.81 MB (over the 16 MB limit). This suggests the method scales well with more training but needs artifact size optimization for the full compute budget.

Credits

Base architecture: LeakyReLU-squared + Parallel Muon + Parameter Banking + XSA + BigramHash by @abaybektursun
TTT recipe: Score-first protocol by @Christopher-Lee-McClendon, adapted by @abaybektursun
SLOT: Context-only variant by @AnubhavBharadwaaj and @abaybektursun
WARP system: This submission (WARP-Len, WARP-Pos, WARP-Type, EMA-disable discovery)
I got a lot of help from Claude throughout the development and experimentation process.

Reproduction

# Single H100
PYTHONUNBUFFERED=1 USE_COMPILE=1 MAX_WALLCLOCK_SECONDS=600 \
ITERATIONS=20000 SEED=1337 TRAIN_BATCH_TOKENS=524288 \
TRAIN_SEQ_LEN=2048 VAL_LOSS_EVERY=250 WARMUP_STEPS=20 \
WARMDOWN_ITERS=250 SWA_EVERY=50 USE_GPTQ=1 EMA_DECAY=0.0 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=2 TTT_FREEZE_BLOCKS=2 \
TTT_MUON=1 SLOT_ENABLED=1 SLOT_LR=0.005 SLOT_STEPS=8 \
python -u train_gpt.py

Test Plan

Training completes within 600s wallclock on 1xH100
Artifact under 16 MB (13.65 MB)
TTT score-first protocol verified
SLOT context-only verified (new tokens excluded from loss)
No validation data accessed during training
WARP uses only input token IDs (no target tokens)
8xH100 verification (not yet done, seeking guidance)

…| 1xH100 10min | 13.65 MB Add WARP system: word length embedding, word position attention bias, and word type logit bias. 183,820 extra params (0.7% overhead). Trained on single H100, 600s wallclock, with legal TTT + context-only SLOT.

ahmetdenizyilmaz · 2026-04-02T11:38:52Z

Closing this PR. Upon further analysis, I discovered that WARP-Len (word length embedding) contains a causality violation: it computes the total number of tokens in each word using scatter_add over the full sequence, which means tokens at position t receive information about future tokens at positions t+1, t+2, etc. (specifically, whether those future tokens start a new word or continue the current one).

While the information leakage is mild (word length, not token identity) and the improvement is modest (-0.06 BPB), it is technically non-causal at the embedding level and therefore the reported results are not valid for a fair comparison.

I am working on a causal version of WARP-Len that replaces total word length with "tokens seen so far in current word" (position_in_word + 1), which only depends on past/current positions. I will resubmit once I have verified results with the fully causal version.

WARP-Pos (word position in attention Q/K) and WARP-Type (word type logit bias) remain fully causal and are not affected by this issue.

Thank you for hosting this challenge -- it has been an incredible learning experience.

Approaches revamped (old eval-only approaches removed): - 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors) - 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability) - 03: SVD + Quantized Factors (13 layers via spectral compression) - 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation) - 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min) Unmerged PR research saved to unmerged_runs/: - PR openai#1263: SLOT (0.9354 BPB, legality contested) - PR openai#1246: Trinity Ternary (0.9650 BPB) - PR openai#1241: MDLM Diffusion (0.9901 BPB) - PR openai#1252: WARP (1.0713 BPP) - PR openai#1257: Complement Training (1.0855 BPB) - PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB) - PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB) - PR openai#1254: XSA + LoRA TTT (1.1070 BPB) Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ahmetdenizyilmaz closed this Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record(?): WARP (Word-Aware Representation Priors) — val_bpb 1.0713 | 1xH100 10min | 13.65 MB#1252

Record(?): WARP (Word-Aware Representation Priors) — val_bpb 1.0713 | 1xH100 10min | 13.65 MB#1252
ahmetdenizyilmaz wants to merge 1 commit intoopenai:mainfrom
ahmetdenizyilmaz:warp-submission

ahmetdenizyilmaz commented Apr 2, 2026

Uh oh!

ahmetdenizyilmaz commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ahmetdenizyilmaz commented Apr 2, 2026

WARP (Word-Aware Representation Priors) — val_bpb 1.0713

Results

Key Innovation: WARP (Word-Aware Representation Priors)

Additional Findings

Architecture

Legality

Note on Hardware

Credits

Reproduction

Test Plan

Uh oh!

ahmetdenizyilmaz commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant