Non-record: MDLM Diffusion — val_var_bpb 1.1465 (first diffusion to beat AR baseline) by agalimova · Pull Request #1106 · openai/parameter-golf

agalimova · 2026-03-30T00:41:22Z

Summary

val_var_bpb: 1.1465 (512 eval steps) | 33M params | 2xH100 80GB HBM3 | Non-record

First discrete diffusion model to beat the AR baseline (1.22 BPB). Beats previous best diffusion (#820, 1.625; #1053, 1.360) by 0.21+ BPB.

Results

Model	BPB
AR SOTA (merged #1)	1.1194
This (MDLM)	1.1465
AR baseline	1.2244
#1053 MDLM	1.360
#820 MDLM	1.625

Approach

MDLM (Sahoo et al. 2024) with log-linear noise, adaLN timestep conditioning, frozen visible-token logits, antithetic sampling, discrete absorbing-mask ELBO eval. 11L 512d, 6000 steps, AdamW.

Key findings from 27 experiments

Masking eps=0.1 >> 0.001 (biggest win)
Eval method matters: MC ELBO = 2.41 BPB, discrete ELBO = 1.15 (same model)
AR tricks that don't transfer: LeakyReLU^2, BigramHash

Hardware

Developed on GB10 (Project DIGITS). Validated on 2xH100 (TensorPool, 31 min). 8xH100 unavailable (#821). Extrapolated: ~8 min on 8xH100.

First discrete diffusion model to beat the AR baseline (1.22) in parameter-golf. MDLM with log-linear noise, adaLN, frozen visible-token logits, discrete ELBO eval. 27 hyperparameter experiments. Validated on 2xH100 (TensorPool), 31 min training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR openai#1106 found eps=0.1 >> 0.001 was the single biggest improvement. With eps=0.1, 10% of tokens remain visible at t=1, giving the model anchors for denoising. Larger terminal KL but much easier task. Also revert lr=1e-3, warmdown=1000 (v8's lr=2e-3 made artifact >16MB). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- train_mdlm_combined.py: full MDLM training script (PR openai#1053 infra + PR openai#1106 MDLM + our innovations) - sweep.sh/sweep2.sh: 12-experiment hyperparameter sweep (eps, arch, loss, seq_len) - results.tsv: updated with v10-v13 experiments, corrected descriptions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

valerio-oai

Selected for the notable non-record submissions section.

aiejvn mentioned this pull request Apr 2, 2026

MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G #1241

Open

valerio-oai approved these changes May 3, 2026

View reviewed changes

valerio-oai merged commit 16af8e1 into openai:main May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: MDLM Diffusion — val_var_bpb 1.1465 (first diffusion to beat AR baseline)#1106

Non-record: MDLM Diffusion — val_var_bpb 1.1465 (first diffusion to beat AR baseline)#1106
valerio-oai merged 1 commit intoopenai:mainfrom
agalimova:submission/mdlm-diffusion

agalimova commented Mar 30, 2026

Uh oh!

valerio-oai left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

agalimova commented Mar 30, 2026

Summary

Results

Approach

Key findings from 27 experiments

Hardware

Uh oh!

valerio-oai left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants