Finetune DiffusionGemma with NeMo Automodel #2509
Pinned
zyzhou5
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
NeMo Automodel now supports supervised fine-tuning — both full-parameter and LoRA — of DiffusionGemma 26B-A4B, Google's block-diffusion Mixture-of-Experts model (PR #2506).
What is DiffusionGemma?
DiffusionGemma is a block-diffusion language model. Instead of generating tokens left-to-right, it iteratively denoises a fixed-length response "canvas": a causal encoder reads the clean prompt + response, and a bidirectional decoder refines the corrupted canvas over several steps, so tokens are produced in parallell.
Checkpoint:
google/diffusiongemma-26B-A4B-it.BlockDiffusionStrategy(nemo_automodel/recipes/dllm/strategy.py)corrupt_uniform_random: draws a per-example noise levelt ~ U(eps, 1)and replaces supervised canvas positions with uniform random vocab tokens (there is no[MASK]).attention_maskis passed.Loss = diffusion (decoder canvas) + co-trained AR (encoder) (
nemo_automodel/components/loss/dllm_loss.py)The training objective combines two terms.
total = diffusion + encoder_loss_weight * AR:BlockDiffusionCrossEntropyLoss) — flat cross-entropy over all supervised canvas positions (corrupted and uncorrupted alike). This supervises the model in its bidirectional denoiser mode: refining the corrupted response canvas.encoder_ar_loss) — standard causal next-token cross-entropy on the encoder's logits over the clean full sequence. This supervises the model in its causal encoder mode: the same weights reading the clean context that conditions the denoiser.Because DiffusionGemma uses one set of weights in two modes (Google's "encoder-denoiser patch"), the two terms supervise the two modes the model actually runs at inference: the diffusion term shapes denoising, the AR term shapes the causal read of the prompt that the denoiser is conditioned on.
Other supported features: FSDP2 + Expert Parallelism, mixed precision (fp32 master weights + bf16 compute), frozen router, and two-pass self-conditioning.
Results
SFT

LoRA
Many thanks to @zyzhou5 @pthombre for all contributions!
Beta Was this translation helpful? Give feedback.
All reactions