MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G#1241
Conversation
Approaches revamped (old eval-only approaches removed): - 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors) - 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability) - 03: SVD + Quantized Factors (13 layers via spectral compression) - 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation) - 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min) Unmerged PR research saved to unmerged_runs/: - PR openai#1263: SLOT (0.9354 BPB, legality contested) - PR openai#1246: Trinity Ternary (0.9650 BPB) - PR openai#1241: MDLM Diffusion (0.9901 BPB) - PR openai#1252: WARP (1.0713 BPP) - PR openai#1257: Complement Training (1.0855 BPB) - PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB) - PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB) - PR openai#1254: XSA + LoRA TTT (1.1070 BPB) Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — MDLM Diffusion v5 (EOS Learning + Shard Rotation)Reported: What this does: Builds on PR #1106's Masked Discrete Language Model (MDLM) baseline. Adds (a) document-boundary anchoring — token 1 ( Smoke test (CT2038 proteus-engine, 2026-04-11): Architecture (factual, from
Central question — is This is the core issue I'd like the author / mods to weigh in on. Two specific concerns visible in the code, both at the eval step:
In short: the headline 0.9901 is (a) a variational upper bound on the MDLM-factorised NLL, (b) computed against a hardcoded 4.3 bytes/token denominator rather than the exact SP byte LUTs, and (c) computed on 500 fixed contiguous Other observations (not blockers, just notes):
Verdict: QUESTIONS RAISED — interesting architecture and a clean MDLM extension, but the headline Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:
Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_OK 0.02s; HAS_HYPERPARAMETERS/HAS_GPT N/A (DiffusionLM, not AR template). AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA |
Builds on PR #1106 (MDLM stack). Two additions:
EOS learning: Token 1 (<s>) is used as a document boundary anchor — never masked during diffusion. A dedicated PAD_ID=1025 (separate from MASK_ID=1024) fills post-EOS positions and is excluded from the loss, preventing collision between structural padding and diffusion masking.
Shard rotation: ShardedDataLoader loads N shards at a time and rotates between groups across training, enabling full FineWeb 10B training without loading the entire dataset into RAM. Explicit memory freeing between groups; shards loaded one-at-a-time into a pre-allocated buffer to avoid 2× peak allocation.
Ablation finding: Val BPB is flat across attention head counts {2, 4, 8, 16, 32} at fixed model dim — head count appears invariant for bidirectional diffusion LMs.
Non-record reason: Trained on 1× AWS A10G (1267 min). Requires 8×H100 SXM for wall-clock compliance.