Finetune DiffusionGemma with NeMo Automodel #2509

zyzhou5 · 2026-06-10T18:25:06Z

zyzhou5
Jun 10, 2026
Collaborator

NeMo Automodel now supports supervised fine-tuning — both full-parameter and LoRA — of DiffusionGemma 26B-A4B, Google's block-diffusion Mixture-of-Experts model (PR #2506).

What is DiffusionGemma?

DiffusionGemma is a block-diffusion language model. Instead of generating tokens left-to-right, it iteratively denoises a fixed-length response "canvas": a causal encoder reads the clean prompt + response, and a bidirectional decoder refines the corrupted canvas over several steps, so tokens are produced in parallell.

Checkpoint: google/diffusiongemma-26B-A4B-it.

BlockDiffusionStrategy (nemo_automodel/recipes/dllm/strategy.py)

Corruption via corrupt_uniform_random: draws a per-example noise level t ~ U(eps, 1) and replaces supervised canvas positions with uniform random vocab tokens (there is no [MASK]).
Batch shape: the encoder sees the clean full sequence (prompt + response); the decoder canvas is the noised response region, with a block-causal attention mask built per example. The decoder is bidirectional, so no causal attention_mask is passed.

Loss = diffusion (decoder canvas) + co-trained AR (encoder) (nemo_automodel/components/loss/dllm_loss.py)

The training objective combines two terms. total = diffusion + encoder_loss_weight * AR:

Diffusion (BlockDiffusionCrossEntropyLoss) — flat cross-entropy over all supervised canvas positions (corrupted and uncorrupted alike). This supervises the model in its bidirectional denoiser mode: refining the corrupted response canvas.
Autoregressive encoder (encoder_ar_loss) — standard causal next-token cross-entropy on the encoder's logits over the clean full sequence. This supervises the model in its causal encoder mode: the same weights reading the clean context that conditions the denoiser.

Because DiffusionGemma uses one set of weights in two modes (Google's "encoder-denoiser patch"), the two terms supervise the two modes the model actually runs at inference: the diffusion term shapes denoising, the AR term shapes the causal read of the prompt that the denoiser is conditioned on.

Other supported features: FSDP2 + Expert Parallelism, mixed precision (fp32 master weights + bf16 compute), frozen router, and two-pass self-conditioning.

Results

SFT

LoRA

Many thanks to @zyzhou5 @pthombre for all contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finetune DiffusionGemma with NeMo Automodel #2509

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Finetune DiffusionGemma with NeMo Automodel #2509

Uh oh!

Uh oh!

zyzhou5 Jun 10, 2026 Collaborator

What is DiffusionGemma?

Results

Replies: 0 comments

zyzhou5
Jun 10, 2026
Collaborator