Record(?): WARP (Word-Aware Representation Priors) — val_bpb 1.0713 | 1xH100 10min | 13.65 MB#1252
Record(?): WARP (Word-Aware Representation Priors) — val_bpb 1.0713 | 1xH100 10min | 13.65 MB#1252ahmetdenizyilmaz wants to merge 1 commit intoopenai:mainfrom
Conversation
…| 1xH100 10min | 13.65 MB Add WARP system: word length embedding, word position attention bias, and word type logit bias. 183,820 extra params (0.7% overhead). Trained on single H100, 600s wallclock, with legal TTT + context-only SLOT.
|
Closing this PR. Upon further analysis, I discovered that WARP-Len (word length embedding) contains a causality violation: it computes the total number of tokens in each word using scatter_add over the full sequence, which means tokens at position t receive information about future tokens at positions t+1, t+2, etc. (specifically, whether those future tokens start a new word or continue the current one). While the information leakage is mild (word length, not token identity) and the improvement is modest (-0.06 BPB), it is technically non-causal at the embedding level and therefore the reported results are not valid for a fair comparison. I am working on a causal version of WARP-Len that replaces total word length with "tokens seen so far in current word" (position_in_word + 1), which only depends on past/current positions. I will resubmit once I have verified results with the fully causal version. WARP-Pos (word position in attention Q/K) and WARP-Type (word type logit bias) remain fully causal and are not affected by this issue. Thank you for hosting this challenge -- it has been an incredible learning experience. |
Approaches revamped (old eval-only approaches removed): - 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors) - 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability) - 03: SVD + Quantized Factors (13 layers via spectral compression) - 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation) - 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min) Unmerged PR research saved to unmerged_runs/: - PR openai#1263: SLOT (0.9354 BPB, legality contested) - PR openai#1246: Trinity Ternary (0.9650 BPB) - PR openai#1241: MDLM Diffusion (0.9901 BPB) - PR openai#1252: WARP (1.0713 BPP) - PR openai#1257: Complement Training (1.0855 BPB) - PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB) - PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB) - PR openai#1254: XSA + LoRA TTT (1.1070 BPB) Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WARP (Word-Aware Representation Priors) — val_bpb 1.0713
val_bpb = 1.0713 | 13.65 MB | 1xH100 SXM, 600s wallclock
Note: I am not sure if this qualifies as a record since it was trained on a single H100 rather than the standard 8xH100 track. I would appreciate guidance from maintainers on whether this is eligible or if 8xH100 verification is required.
This method is part of an in-progress paper (WARP: Word-Aware Representation Priors for Subword Language Models) currently being written up. The results exceeded my expectations significantly and I honestly suspect there may be an error somewhere — I would welcome the community's help in verifying or finding issues.
Results
Key Innovation: WARP (Word-Aware Representation Priors)
BPE tokenization destroys word boundary information that the model must re-learn through attention. WARP restores this information at three injection points with only 183,820 additional parameters (0.7% overhead):
WARP-Len (6,657 params) — Word length embedding at layer 0. Each token receives an embedding based on how many BPE tokens its word contains. Injected before RMSNorm, before any attention.
WARP-Pos (1,035 params) — Word position bias in Q and K. Learned per-layer embeddings based on within-word position (0-7), applied to both queries and keys before RoPE. Shared across all 11 layers with per-layer learned scales.
WARP-Type (176,128 params) — Word type logit bias at output. A 2-layer classifier (512->192->64 types) produces soft type probabilities. These are multiplied with a learned type-vocabulary bias matrix (64x1024) and added directly to logits before softcap. No auxiliary loss — gradient flows from cross-entropy through the bias to the classifier.
All three modules share
compute_word_boundary_maps()which detects word starts from SentencePiece's leading-space convention using only token IDs. Fullytorch.compilecompatible.Additional Findings
type_vocab_bias.float()required to prevent bf16/fp32 mismatch during SLOT'scompute_logitscall.Architecture
Built on the stack from the LeakyReLU-squared + Legal TTT + Parallel Muon submission (by @abaybektursun) with the following modifications:
Legality
Training: Standard transformer training with architectural modifications (WARP modules). No validation data accessed. GPTQ calibrates on training data.
TTT: Score-first protocol — each chunk scored under
torch.inference_mode()before any weight update. Same legal pattern as the LeakyReLU-squared + Legal TTT submission by @abaybektursun.SLOT: Context-only variant — delta optimized on positions 0-1983 (already-scored context). New tokens (1984-2047) excluded from loss and contribute zero gradient. Based on the MuonEq-R + Context-Only SLOT submission by @abaybektursun.
WARP: Purely architectural. Word boundaries derived from token IDs via SentencePiece vocabulary lookup table. No external data, tools, or target token access.
Note on Hardware
This result was obtained on a single H100 (1xH100, 600s, ~1260 steps). On 8xH100 (the standard track hardware), the model would train ~7185 steps in the same wallclock time, yielding significantly better results. The single-GPU result represents a conservative lower bound.
I also ran a longer 50-minute training (6436 steps, matching 8xH100 compute) which achieved 0.9766 post-SLOT, but the artifact was 16.81 MB (over the 16 MB limit). This suggests the method scales well with more training but needs artifact size optimization for the full compute budget.
Credits
Reproduction
# Single H100 PYTHONUNBUFFERED=1 USE_COMPILE=1 MAX_WALLCLOCK_SECONDS=600 \ ITERATIONS=20000 SEED=1337 TRAIN_BATCH_TOKENS=524288 \ TRAIN_SEQ_LEN=2048 VAL_LOSS_EVERY=250 WARMUP_STEPS=20 \ WARMDOWN_ITERS=250 SWA_EVERY=50 USE_GPTQ=1 EMA_DECAY=0.0 \ TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=2 TTT_FREEZE_BLOCKS=2 \ TTT_MUON=1 SLOT_ENABLED=1 SLOT_LR=0.005 SLOT_STEPS=8 \ python -u train_gpt.pyTest Plan