Skip to content

[trainer] feat: Add Nemo-Automodel as alternative training engine#5407

Merged
ISEEKYAN merged 21 commits intoverl-project:mainfrom
HuiyingLi:add_automodel_sft_backend
Mar 20, 2026
Merged

[trainer] feat: Add Nemo-Automodel as alternative training engine#5407
ISEEKYAN merged 21 commits intoverl-project:mainfrom
HuiyingLi:add_automodel_sft_backend

Conversation

@HuiyingLi
Copy link
Copy Markdown
Contributor

@HuiyingLi HuiyingLi commented Feb 26, 2026

What does this PR do?

Add NeMo-Automodel as a training engine. The SFT trainer is tested with Qwen2.5-0.5B.

  • automodel engine matches exactly with FSDP engine for SFT trainer (TP1/TP2/rmpad=True/False)
  • use_remove_padding=True matches use_remove_padding=False
  • EP support tested with kimi moonlight 16B

Relevant PRs:

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

automodel backend 1gpu & 4gpu tp1/tp2 against FSDP backend 1gpu, rmpad true/false.
image

automodel backend finetuning moonlight 16B with ep8 8H100
image

automodel backend finetuning Qwen3 30B with ep8 8H100
image

automodel backend finetuning qwen2.5-7b with 4H100 fsdp2
image

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
…ompatibility

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Feb 26, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new automodel SFT backend, which leverages nemo_automodel for distributed training. The changes include adding the engine implementation, configuration files, and test scripts. I've identified a configuration issue in the test script and a maintainability concern in the engine implementation. Overall, this is a significant feature addition.

Comment on lines +549 to +552
if isinstance(output, torch.Tensor):
from types import SimpleNamespace

output = SimpleNamespace(logits=output)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model's output is conditionally wrapped in a SimpleNamespace if it's a raw tensor. This suggests an inconsistent return type from self.module, which can make the code harder to maintain and reason about. It would be more robust to enforce a consistent, structured return type (like CausalLMOutput) from the model to avoid such conditional handling.

@HuiyingLi HuiyingLi changed the title Add automodel sft backend [trainer] feat: Add Nemo-Automodel as alternative training engine Feb 27, 2026
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi HuiyingLi marked this pull request as ready for review February 27, 2026 10:10
@ISEEKYAN
Copy link
Copy Markdown
Collaborator

hi @HuiyingLi , thanks to your great contribution.
I found the MFU of automodel is lower than FSDP on 0.5B model and the MFU is less than 1% on 16B MoE model, is this supposed to be right? Could you provide some fair comparison on popular models such as 7B dense or 30B MoE?

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

hi @HuiyingLi , thanks to your great contribution. I found the MFU of automodel is lower than FSDP on 0.5B model and the MFU is less than 1% on 16B MoE model, is this supposed to be right? Could you provide some fair comparison on popular models such as 7B dense or 30B MoE?

Hi @ISEEKYAN ,
Thank you!

  • for 0.5B model, FSDP engine is single GPU while automodel is 4 GPU. I've updated the chart with automodel single GPU for comparison.
  • for the low mfu on 16B model: due to very small seqlen and batch size. I've updated the chart with larger seqlen and gbs, and added charts for Qwen 30B MoE and Qwen 7B dense.

@ISEEKYAN
Copy link
Copy Markdown
Collaborator

ISEEKYAN commented Mar 2, 2026

great, is your exp on H100? If true, the MFU looks good but it will be better to have a fair comparison with FSDP or Megatron. But this is not a blocker of merging this PR. It is just a good reference for the users to adopt AutoModel, and it will be better to add a doc to show the comparison and an example for the user to easily hand on

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@ETOgaosion
Copy link
Copy Markdown
Collaborator

@HuiyingLi Thanks for your great contribution! Could you please sign the CLA~


**Requirements**

- Automodel r0.3.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in another PR, we should refactor docs/start/install.rst to support all model engines and rollout engine install methods and use some displays for better choices.

@ETOgaosion
Copy link
Copy Markdown
Collaborator

We should also prepare some CI tests for Nemo-Automodel

@HuiyingLi
Copy link
Copy Markdown
Contributor Author

@ETOgaosion Thank you for reviewing! I've signed the CLA. Could you please suggest how to pass the ci for this PR? For the license error do you suggest to add Nvidia to the list? Should we add ci tests in this PR? Thank you!

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ETOgaosion
Copy link
Copy Markdown
Collaborator

For this Nemo-Automodel CI, wondering whether verl's docker image contains needed components of Nemo-AutoModel?

Or if there is no dependency conflicts with megatron, we can directly reuse current megatron docker images to run this backend by adding needed packages.

For Adding CI, we can use another PR to test new images, we can merge this first.

With the image, you can enable test in tests/special_e2e/sft/test_sft_engine_all.sh as you implemented here to test.

@@ -0,0 +1,20 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License LGTM

@ETOgaosion
Copy link
Copy Markdown
Collaborator

@HuiyingLi For failed CIs, could you rebase main and retry?

@ISEEKYAN ISEEKYAN merged commit 2a4b096 into verl-project:main Mar 20, 2026
88 of 103 checks passed
sijyang pushed a commit to sijyang/verl that referenced this pull request Apr 1, 2026
…rl-project#5407)

### What does this PR do?

Add NeMo-Automodel as a training engine. The SFT trainer is tested with
Qwen2.5-0.5B.
- automodel engine matches exactly with FSDP engine for SFT trainer
(TP1/TP2/rmpad=True/False)
- use_remove_padding=True matches use_remove_padding=False
- EP support tested with kimi moonlight 16B

Relevant PRs:
- RFC verl-project#5245
- Add VeOmni verl-project#4072 
- Add TorchTitan verl-project#5051 

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`,
`fully_async`, `one_step_off`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

automodel backend 1gpu & 4gpu tp1/tp2 against FSDP backend 1gpu, rmpad
true/false.
<img width="1384" height="473" alt="image"
src="https://github.com/user-attachments/assets/90ffa1e8-7d9b-4f66-a1c1-bc06fa117c82"
/>


automodel backend finetuning moonlight 16B with ep8 8H100
<img width="1199" height="470" alt="image"
src="https://github.com/user-attachments/assets/e63c7a43-293f-4acf-838c-eeb118f35ff9"
/>

automodel backend finetuning Qwen3 30B with ep8 8H100
<img width="1200" height="474" alt="image"
src="https://github.com/user-attachments/assets/5ce33ea8-80e5-467f-8601-247f9602f977"
/>

automodel backend finetuning qwen2.5-7b with 4H100 fsdp2
<img width="1434" height="475" alt="image"
src="https://github.com/user-attachments/assets/c0ed2a8c-39d4-4404-997c-03bb6b15139e"
/>




### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [ ] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ZouKexin-522 pushed a commit to ZouKexin-522/verl that referenced this pull request Apr 8, 2026
…rl-project#5407)

### What does this PR do?

Add NeMo-Automodel as a training engine. The SFT trainer is tested with
Qwen2.5-0.5B.
- automodel engine matches exactly with FSDP engine for SFT trainer
(TP1/TP2/rmpad=True/False)
- use_remove_padding=True matches use_remove_padding=False
- EP support tested with kimi moonlight 16B

Relevant PRs:
- RFC verl-project#5245
- Add VeOmni verl-project#4072 
- Add TorchTitan verl-project#5051 

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`,
`fully_async`, `one_step_off`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

automodel backend 1gpu & 4gpu tp1/tp2 against FSDP backend 1gpu, rmpad
true/false.
<img width="1384" height="473" alt="image"
src="https://github.com/user-attachments/assets/90ffa1e8-7d9b-4f66-a1c1-bc06fa117c82"
/>


automodel backend finetuning moonlight 16B with ep8 8H100
<img width="1199" height="470" alt="image"
src="https://github.com/user-attachments/assets/e63c7a43-293f-4acf-838c-eeb118f35ff9"
/>

automodel backend finetuning Qwen3 30B with ep8 8H100
<img width="1200" height="474" alt="image"
src="https://github.com/user-attachments/assets/5ce33ea8-80e5-467f-8601-247f9602f977"
/>

automodel backend finetuning qwen2.5-7b with 4H100 fsdp2
<img width="1434" height="475" alt="image"
src="https://github.com/user-attachments/assets/c0ed2a8c-39d4-4404-997c-03bb6b15139e"
/>




### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [ ] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants