Gemma4 Assistant model support for MTP #2481

athitten · 2026-06-09T19:54:43Z

athitten
Jun 9, 2026
Collaborator

NeMo-Automodel recently added support for Gemma4 Assistant model (also called drafter model) targeted towards faster inference with Multi-Token Prediction (MTP). These are separate models (for ex: google/gemma-4-E4B-it-assistant) which are significantly smaller (~70M params) than the base/target model and help predict multiple tokens in the same time the base model predicts one token.

NeMo-Automodel specifically adds support for joint fine-tuning of Gemma 4 base and Gemma4 drafter/assistant models so that end-users can fine-tune both of them jointly on custom data and be able to leverage the benefits of speculative decoding at the time of inference on their model.
Our experiments have shown ~1.98x speed-up with MTP compared to without MTP at the time of inference post joint fine-tuning.

The drafter is co-trained with the Gemma 4 base end-to-end via a composite model (Gemma4WithDrafter) that wires up shared K/V states, sqrt(H_b)-scaled embeddings, and a K-step recurrent forward matching the Gemma 4 drafter tech report.

Joint Finetuning Recipes

We provide two reference configs, for joint fine-tuning of gemma-4-E4B-it and gemma-4-E4B-it-assistant: one with MedPix VQA dataset: gemma4_4b_joint_drafter_medpix.yaml and the other with a text-only Tulu-3 + Magicoder mix which is a larger dataset than MedPix-VQA: gemma4_4b_joint_drafter_tulu_magicoder_mix.yaml. We also provide an inference benchmark script that validates speculative-decode throughput on the saved checkpoint: benchmark_mtp_inference.py

Results

Fine-tuning the joint model on MedPix-VQA Using the recipe examples/vlm_finetune/gemma4_joint_drafter/gemma4_4b_joint_drafter_medpix.yaml
Loss curve:

Large-scale fine-tuning on a Tulu-3 (80 %) + Magicoder (20 %) mix for 500 steps. Stable fine-tuning run observed with the recipe examples/vlm_finetune/gemma4_joint_drafter/gemma4_4b_joint_drafter_tulu_magicoder_mix.yaml
Loss curve:

Inference run on the tulu + magicoder fine-tuned checkpoint after 500 steps. Ran benchmark_mtp_inference.py(uses transformers generate) against the saved base/ + drafter/ pair. Results below:

Metric	no MTP	with MTP
# prompts	20	20
total new tokens	3,627	3,655
total decode time (s)	144.80	73.82
aggregate tokens/sec	25.05	49.51
mean accepted tokens/step	1.000	2.473
wall-clock speed-up (MTP / no MTP)	—	1.98×

MTP support for other models in NeMo-Automodel

While Gemma4 assistant is a separate model for which NeMo-Automodel provides scaffolding to jointly fine-tune with the base model, we also support MTP for models that inherently have MTP layers in them like Nemotron V3, DeepSeek V4 (Flash), Qwen3.6 dense and MoE. Check out the respective model recipes to enable them.

Thanks to @adil-a @khazic @HuiyingLi for this!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemma4 Assistant model support for MTP #2481

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Gemma4 Assistant model support for MTP #2481

Uh oh!

Uh oh!

athitten Jun 9, 2026 Collaborator

Joint Finetuning Recipes

Results

MTP support for other models in NeMo-Automodel

Replies: 0 comments

athitten
Jun 9, 2026
Collaborator