Skip to content

MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G#1241

Open
aiejvn wants to merge 2 commits intoopenai:mainfrom
aiejvn:submission-diffusion-shard-rotation+eos-learning
Open

MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G#1241
aiejvn wants to merge 2 commits intoopenai:mainfrom
aiejvn:submission-diffusion-shard-rotation+eos-learning

Conversation

@aiejvn
Copy link
Copy Markdown

@aiejvn aiejvn commented Apr 2, 2026

Builds on PR #1106 (MDLM stack). Two additions:

EOS learning: Token 1 (<s>) is used as a document boundary anchor — never masked during diffusion. A dedicated PAD_ID=1025 (separate from MASK_ID=1024) fills post-EOS positions and is excluded from the loss, preventing collision between structural padding and diffusion masking.

Shard rotation: ShardedDataLoader loads N shards at a time and rotates between groups across training, enabling full FineWeb 10B training without loading the entire dataset into RAM. Explicit memory freeing between groups; shards loaded one-at-a-time into a pre-allocated buffer to avoid 2× peak allocation.

Ablation finding: Val BPB is flat across attention head counts {2, 4, 8, 16, 32} at fixed model dim — head count appears invariant for bidirectional diffusion LMs.

Non-record reason: Trained on 1× AWS A10G (1267 min). Requires 8×H100 SXM for wall-clock compliance.

Model BPB
This (MDLM v5) 0.9901
PR #1106 (prior best diffusion) 1.1465
AR baseline 1.2244

@aiejvn aiejvn changed the title Non-record: MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G Apr 2, 2026
HateBunnyPlzzz added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 2, 2026
Approaches revamped (old eval-only approaches removed):
- 01: Low-Rank Factored MLP (18 layers in 16MB via rank-128 MLP factors)
- 02: Reptile Meta-Learning Warmdown (meta-optimize for TTT adaptability)
- 03: SVD + Quantized Factors (13 layers via spectral compression)
- 04: Multi-Token Prediction + BPB-Weighted Loss (training loss innovation)
- 05: Gram-Newton-Schulz + FP8 Training (30% more steps in 10 min)

Unmerged PR research saved to unmerged_runs/:
- PR openai#1263: SLOT (0.9354 BPB, legality contested)
- PR openai#1246: Trinity Ternary (0.9650 BPB)
- PR openai#1241: MDLM Diffusion (0.9901 BPB)
- PR openai#1252: WARP (1.0713 BPP)
- PR openai#1257: Complement Training (1.0855 BPB)
- PR openai#1274: Parallel Residuals + Depth Recurrence (1.0876 BPB)
- PR openai#1260: MuonEq-R + Depth Recurrence (1.0929 BPB)
- PR openai#1254: XSA + LoRA TTT (1.1070 BPB)

Key finding: without eval tricks, frontier is ~1.09 BPB (PR openai#1260)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — MDLM Diffusion v5 (EOS Learning + Shard Rotation)

Reported: val_var_bpb 0.9901 | Seeds: [42] (1 seed) | Hardware: 1x AWS A10G (1267 min) | Track: non-record

What this does: Builds on PR #1106's Masked Discrete Language Model (MDLM) baseline. Adds (a) document-boundary anchoring — token 1 (<s>) is treated as EOS and is never masked during forward diffusion, with a dedicated PAD_ID=1025 (separate from MASK_ID=1024) filling post-EOS positions and excluded from the loss; and (b) ShardedDataLoader, which rotates groups of N FineWeb shards through RAM so that the full 80-shard SP-1024 corpus can be visited on a 24GB consumer/cloud GPU. Also reports a head-count ablation showing val BPB is roughly invariant to n_heads ∈ {2,4,8,16,32} for bidirectional diffusion LMs at fixed MODEL_DIM=512.

Smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_OK seconds=0.02. Top-level parses cleanly on CPU. The harness's HAS_HYPERPARAMETERS / HAS_GPT checks are not applicable here — this is a DiffusionLM module, not the AR GPT/Hyperparameters template. No top-level CUDA assumptions in module scope.

Architecture (factual, from train_mdlm.py):

  • 11 layers, MODEL_DIM=512, 8 heads, MLP mult 3× (ReLU²), RoPE, AdaLN sigma conditioning, bidirectional is_causal=False SDPA. ~33M params (header).
  • TOTAL_VOCAB = 1026 (1024 SP + MASK + PAD); PADDED_VOCAB = 1088 for the embedding/unembed table; head(...)[..., :TOTAL_VOCAB] slices outputs.
  • subs_log_probs blocks MASK_ID and PAD_ID from prediction (-1e6), renormalises, and freezes visible positions to identity — standard MDLM substitution parameterisation.
  • Training loss: continuous-time MDLM ELBO with antithetic time sampling, dsigma * -log p(x0|xt) masked to mask-only positions, weighted by content_mask = (x0 != PAD).
  • Eval: variational_elbo_bits runs a 128-step Riemann discretisation of the discrete absorbing-mask ELBO over 500 fixed SEQ_LEN=2048 slices of the val split.

Central question — is val_var_bpb directly comparable to the leaderboard metric?

This is the core issue I'd like the author / mods to weigh in on. Two specific concerns visible in the code, both at the eval step:

  1. Variational upper bound, not exact NLL. The reported number is the discrete absorbing-mask ELBO computed by variational_elbo_bits (train_mdlm.py ~L173–215). For diffusion LMs this is a variational upper bound on the true negative log-likelihood. The leaderboard metric, by contrast, is the exact per-token NLL of an autoregressive factorisation evaluated via sliding-window stride-64 on the canonical eval path. ELBO ≥ true NLL, so 0.9901 is an upper bound on this model's BPB under the MDLM factorisation — the actual NLL could be lower. But the inverse is also true: it is not the same quantity that AR submissions report. A 0.9901 ELBO from a diffusion model and a 0.9901 sliding-window NLL from an AR model are not interchangeable, and at minimum the comparison to "AR SOTA 1.1194" / "AR baseline 1.2244" in the README needs that asterisk attached. (See Record Submission: HDC_DSV_Hadamard_Spiral_val_bpb: 0.4067  #1461 for an analogous metric-definition discussion on a non-AR submission.)

  2. Bytes/token uses a hardcoded 4.3 constant, not the per-token SP byte LUT. Both the inline progress estimate and the final reported number compute bpb = total_bits / (total_content * 4.3) (train_mdlm.py L361, ~L398, ~L407). The competition scoring formula divides total bits by the exact sum of UTF-8 bytes per token from build_sentencepiece_luts over the eval slice — not by n_tokens × avg_bytes_per_token. For SP-1024 on FineWeb, average bytes/token is roughly in the 4.2–4.4 range, so 4.3 is in the right neighbourhood, but it is not what the canonical evaluator computes. The README claims "Competition BPB uses sentencepiece byte-count LUTs (exact bytes per token, matching competition scoring formula)" — but the code in this PR does not actually do that; it uses the constant. Whether the resulting number is high or low vs the canonical computation depends on the exact byte-distribution of the 500 eval slices, and is not something I can determine without running it.

In short: the headline 0.9901 is (a) a variational upper bound on the MDLM-factorised NLL, (b) computed against a hardcoded 4.3 bytes/token denominator rather than the exact SP byte LUTs, and (c) computed on 500 fixed contiguous SEQ_LEN=2048 slices of val rather than the canonical sliding-window stride-64 path. Each of those is a separate axis of incomparability; the net effect is that this number cannot be placed on the leaderboard against AR submissions without a mod ruling on what "BPB" means for a diffusion LM, and ideally a re-run through the canonical evaluator.

Other observations (not blockers, just notes):

  • This is correctly filed as non-record, and the PR body and submission.json both flag the A10G hardware and the need to rerun on 8xH100 SXM. So none of the above is a record-track compliance issue — it's a "what does the number on the title actually mean" question for the non-record / novel-architecture track.
  • submission.json reports a single seed (42); the README's invariance ablation is across head counts at one seed, not across seeds for the headline number.
  • No custom tokenizer; SP-1024 is the standard path, so BUG: bpb underestimated when tokenizer does not contain U+2581 (ie the space) token #897 / U+2581 byte-fallback concerns do not apply.
  • TTT / SLOT / n-gram cache rulings (Invalid submissions due to information leakage during TTT #402, Illegal submissions megathread #677, Legality question: Is context-only (causal) SLOT legal? #1336) do not apply — this is pure pretraining + ELBO eval, no test-time adaptation, no eval-time caches.
  • The MDLM_EOS_FullDataset_A10G/ directory adds a train_mdlm.py rather than the canonical train_gpt.py — worth noting that any automated runner wired to train_gpt.py will not pick this up as-is.
  • Head-count invariance ablation is a genuinely interesting empirical claim for bidirectional diffusion LMs and worth keeping in the record regardless of the metric question.

Verdict: QUESTIONS RAISED — interesting architecture and a clean MDLM extension, but the headline val_var_bpb 0.9901 is a variational upper bound computed with a hardcoded bytes/token constant, not the canonical sliding-window stride-64 NLL the leaderboard uses. Worth keeping in the non-record track for the technique, but the number should not be compared directly to AR submissions without a mod ruling and ideally a canonical-evaluator re-run.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:

  • HOLD on any leaderboard placement pending a ruling on (a) whether ELBO-based BPB is comparable to exact-NLL BPB for the non-record track and (b) whether the 4.3 bytes/token constant in the eval path needs to be replaced with the SP byte LUT before the number is quoted alongside AR submissions.
  • NEEDS AUTHOR ACTION (optional, if author wants the number to be directly comparable): swap total_content * 4.3 for sum(byte_lut[t] for t in eval_tokens) and re-run the 128-step ELBO; ideally also report the exact-NLL via the canonical sliding-window path on the same checkpoint, so both numbers are on the table.
  • Otherwise the technique itself (EOS anchoring, PAD/MASK separation, shard rotation, head-count invariance ablation) is a clean and well-documented contribution to the diffusion-LM line and worth keeping in the non-record track on its own merits.

Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_OK 0.02s; HAS_HYPERPARAMETERS/HAS_GPT N/A (DiffusionLM, not AR template). AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 5d46f39f69290735adc989930c7f5fe4e0779158.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants