[trainer] feat: Add Nemo-Automodel as alternative training engine by HuiyingLi · Pull Request #5407 · verl-project/verl

HuiyingLi · 2026-02-26T06:57:22Z

What does this PR do?

Add NeMo-Automodel as a training engine. The SFT trainer is tested with Qwen2.5-0.5B.

automodel engine matches exactly with FSDP engine for SFT trainer (TP1/TP2/rmpad=True/False)
use_remove_padding=True matches use_remove_padding=False
EP support tested with kimi moonlight 16B

Relevant PRs:

RFC [RFC] support Nemo-AutoModel as an alternative training backend #5245
Add VeOmni [trainer] feat: Implemented VeomniEngine as a alternative training backend #4072
Add TorchTitan [trainer] feat: Add Torchtitan as alternative training engine #5051

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

automodel backend 1gpu & 4gpu tp1/tp2 against FSDP backend 1gpu, rmpad true/false.

automodel backend finetuning moonlight 16B with ep8 8H100

automodel backend finetuning Qwen3 30B with ep8 8H100

automodel backend finetuning qwen2.5-7b with 4H100 fsdp2

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

…ompatibility Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

CLAassistant · 2026-02-26T06:57:30Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces a new automodel SFT backend, which leverages nemo_automodel for distributed training. The changes include adding the engine implementation, configuration files, and test scripts. I've identified a configuration issue in the test script and a maintainability concern in the engine implementation. Overall, this is a significant feature addition.

tests/special_e2e/sft/run_sft_engine.sh

gemini-code-assist · 2026-02-26T07:04:28Z

verl/workers/engine/automodel/transformer_impl.py

+        if isinstance(output, torch.Tensor):
+            from types import SimpleNamespace
+
+            output = SimpleNamespace(logits=output)


The model's output is conditionally wrapped in a SimpleNamespace if it's a raw tensor. This suggests an inconsistent return type from self.module, which can make the code harder to maintain and reason about. It would be more robust to enforce a consistent, structured return type (like CausalLMOutput) from the model to avoid such conditional handling.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

verl/workers/engine/automodel/__init__.py

ISEEKYAN · 2026-02-28T03:29:45Z

hi @HuiyingLi , thanks to your great contribution.
I found the MFU of automodel is lower than FSDP on 0.5B model and the MFU is less than 1% on 16B MoE model, is this supposed to be right? Could you provide some fair comparison on popular models such as 7B dense or 30B MoE?

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-03-02T09:28:37Z

hi @HuiyingLi , thanks to your great contribution. I found the MFU of automodel is lower than FSDP on 0.5B model and the MFU is less than 1% on 16B MoE model, is this supposed to be right? Could you provide some fair comparison on popular models such as 7B dense or 30B MoE?

Hi @ISEEKYAN ,
Thank you!

for 0.5B model, FSDP engine is single GPU while automodel is 4 GPU. I've updated the chart with automodel single GPU for comparison.
for the low mfu on 16B model: due to very small seqlen and batch size. I've updated the chart with larger seqlen and gbs, and added charts for Qwen 30B MoE and Qwen 7B dense.

ISEEKYAN · 2026-03-02T10:03:13Z

great, is your exp on H100? If true, the MFU looks good but it will be better to have a fair comparison with FSDP or Megatron. But this is not a blocker of merging this PR. It is just a good reference for the users to adopt AutoModel, and it will be better to add a doc to show the comparison and an example for the user to easily hand on

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

ETOgaosion · 2026-03-13T08:11:43Z

@HuiyingLi Thanks for your great contribution! Could you please sign the CLA~

ETOgaosion · 2026-03-13T08:15:07Z

docs/workers/automodel_workers.rst

+
+**Requirements**
+
+- Automodel r0.3.0


Maybe in another PR, we should refactor docs/start/install.rst to support all model engines and rollout engine install methods and use some displays for better choices.

ETOgaosion · 2026-03-13T08:18:59Z

We should also prepare some CI tests for Nemo-Automodel

HuiyingLi · 2026-03-17T06:53:46Z

@ETOgaosion Thank you for reviewing! I've signed the CLA. Could you please suggest how to pass the ci for this PR? For the license error do you suggest to add Nvidia to the list? Should we add ci tests in this PR? Thank you!

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ETOgaosion · 2026-03-18T09:16:17Z

For this Nemo-Automodel CI, wondering whether verl's docker image contains needed components of Nemo-AutoModel?

Or if there is no dependency conflicts with megatron, we can directly reuse current megatron docker images to run this backend by adding needed packages.

For Adding CI, we can use another PR to test new images, we can merge this first.

With the image, you can enable test in tests/special_e2e/sft/test_sft_engine_all.sh as you implemented here to test.

ETOgaosion · 2026-03-18T09:17:14Z

verl/workers/engine/automodel/__init__.py

@@ -0,0 +1,20 @@
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.


License LGTM

ETOgaosion · 2026-03-18T10:34:32Z

@HuiyingLi For failed CIs, could you rebase main and retry?

…rl-project#5407) ### What does this PR do? Add NeMo-Automodel as a training engine. The SFT trainer is tested with Qwen2.5-0.5B. - automodel engine matches exactly with FSDP engine for SFT trainer (TP1/TP2/rmpad=True/False) - use_remove_padding=True matches use_remove_padding=False - EP support tested with kimi moonlight 16B Relevant PRs: - RFC verl-project#5245 - Add VeOmni verl-project#4072 - Add TorchTitan verl-project#5051 ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test automodel backend 1gpu & 4gpu tp1/tp2 against FSDP backend 1gpu, rmpad true/false. <img width="1384" height="473" alt="image" src="https://github.com/user-attachments/assets/90ffa1e8-7d9b-4f66-a1c1-bc06fa117c82" /> automodel backend finetuning moonlight 16B with ep8 8H100 <img width="1199" height="470" alt="image" src="https://github.com/user-attachments/assets/e63c7a43-293f-4acf-838c-eeb118f35ff9" /> automodel backend finetuning Qwen3 30B with ep8 8H100 <img width="1200" height="474" alt="image" src="https://github.com/user-attachments/assets/5ce33ea8-80e5-467f-8601-247f9602f977" /> automodel backend finetuning qwen2.5-7b with 4H100 fsdp2 <img width="1434" height="475" alt="image" src="https://github.com/user-attachments/assets/c0ed2a8c-39d4-4404-997c-03bb6b15139e" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuiyingLi added 6 commits February 23, 2026 20:04

init version with fsdp2

07554c9

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

add mp policy config

11eb1a9

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

add ep and expose more configs

4641e75

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

fix(dataset): call .tolist() before tokenizer.decode() for tiktoken c…

697bf68

…ompatibility Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

add test

41dd4a8

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

format

c33321b

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

gemini-code-assist bot reviewed Feb 26, 2026

View reviewed changes

HuiyingLi changed the title ~~Add automodel sft backend~~ [trainer] feat: Add Nemo-Automodel as alternative training engine Feb 27, 2026

HuiyingLi added 3 commits February 27, 2026 01:42

revert some format changes

9a14478

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

fix eval ctx

4d7a193

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

fix exp name

6b3f061

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi marked this pull request as ready for review February 27, 2026 10:10

HuiyingLi requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners February 27, 2026 10:10

ISEEKYAN reviewed Feb 28, 2026

View reviewed changes

verl/workers/engine/automodel/__init__.py Outdated Show resolved Hide resolved

HuiyingLi added 2 commits March 2, 2026 01:14

add expert torch_mm backend to config

3208fbd

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

change copyright

a0b51f8

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi added 7 commits March 5, 2026 22:18

Merge branch 'main' into add_automodel_sft_backend

d2eec66

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

upgrade to automodel r0.3.0 with transformers v5.0.0

ec3b283

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

add automodel examples scripts

c1e8025

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

add docs

6060737

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

update optimizer integration

20cd9dc

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

update example scripts

1b9c6aa

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

add dependency req to examples

db0d6ca

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

format

48b7315

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

ETOgaosion reviewed Mar 13, 2026

View reviewed changes

add NVIDIA license header to check_license.py

5915ef3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ETOgaosion reviewed Mar 18, 2026

View reviewed changes

Merge branch 'main' into add_automodel_sft_backend

ff49467

ETOgaosion approved these changes Mar 19, 2026

View reviewed changes

ISEEKYAN merged commit 2a4b096 into verl-project:main Mar 20, 2026
88 of 103 checks passed

		@@ -0,0 +1,20 @@
		# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Conversation

HuiyingLi commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ISEEKYAN commented Feb 28, 2026

Uh oh!

HuiyingLi commented Mar 2, 2026

Uh oh!

ISEEKYAN commented Mar 2, 2026

Uh oh!

ETOgaosion commented Mar 13, 2026

Uh oh!

ETOgaosion Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

ETOgaosion commented Mar 13, 2026

Uh oh!

HuiyingLi commented Mar 17, 2026

Uh oh!

ETOgaosion commented Mar 18, 2026

Uh oh!

ETOgaosion Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

ETOgaosion commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HuiyingLi commented Feb 26, 2026 •

edited

Loading

CLAassistant commented Feb 26, 2026 •

edited

Loading