[trainer] feat: Add Nemo-Automodel as alternative training engine#5407
[trainer] feat: Add Nemo-Automodel as alternative training engine#5407ISEEKYAN merged 21 commits intoverl-project:mainfrom
Conversation
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
…ompatibility Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new automodel SFT backend, which leverages nemo_automodel for distributed training. The changes include adding the engine implementation, configuration files, and test scripts. I've identified a configuration issue in the test script and a maintainability concern in the engine implementation. Overall, this is a significant feature addition.
| if isinstance(output, torch.Tensor): | ||
| from types import SimpleNamespace | ||
|
|
||
| output = SimpleNamespace(logits=output) |
There was a problem hiding this comment.
The model's output is conditionally wrapped in a SimpleNamespace if it's a raw tensor. This suggests an inconsistent return type from self.module, which can make the code harder to maintain and reason about. It would be more robust to enforce a consistent, structured return type (like CausalLMOutput) from the model to avoid such conditional handling.
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
hi @HuiyingLi , thanks to your great contribution. |
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Hi @ISEEKYAN ,
|
|
great, is your exp on H100? If true, the MFU looks good but it will be better to have a fair comparison with FSDP or Megatron. But this is not a blocker of merging this PR. It is just a good reference for the users to adopt AutoModel, and it will be better to add a doc to show the comparison and an example for the user to easily hand on |
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
@HuiyingLi Thanks for your great contribution! Could you please sign the CLA~ |
|
|
||
| **Requirements** | ||
|
|
||
| - Automodel r0.3.0 |
There was a problem hiding this comment.
Maybe in another PR, we should refactor docs/start/install.rst to support all model engines and rollout engine install methods and use some displays for better choices.
|
We should also prepare some CI tests for Nemo-Automodel |
|
@ETOgaosion Thank you for reviewing! I've signed the CLA. Could you please suggest how to pass the ci for this PR? For the license error do you suggest to add Nvidia to the list? Should we add ci tests in this PR? Thank you! |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
For this Nemo-Automodel CI, wondering whether verl's docker image contains needed components of Nemo-AutoModel? Or if there is no dependency conflicts with megatron, we can directly reuse current megatron docker images to run this backend by adding needed packages. For Adding CI, we can use another PR to test new images, we can merge this first. With the image, you can enable test in tests/special_e2e/sft/test_sft_engine_all.sh as you implemented here to test. |
| @@ -0,0 +1,20 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
|
@HuiyingLi For failed CIs, could you rebase main and retry? |
…rl-project#5407) ### What does this PR do? Add NeMo-Automodel as a training engine. The SFT trainer is tested with Qwen2.5-0.5B. - automodel engine matches exactly with FSDP engine for SFT trainer (TP1/TP2/rmpad=True/False) - use_remove_padding=True matches use_remove_padding=False - EP support tested with kimi moonlight 16B Relevant PRs: - RFC verl-project#5245 - Add VeOmni verl-project#4072 - Add TorchTitan verl-project#5051 ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test automodel backend 1gpu & 4gpu tp1/tp2 against FSDP backend 1gpu, rmpad true/false. <img width="1384" height="473" alt="image" src="https://github.com/user-attachments/assets/90ffa1e8-7d9b-4f66-a1c1-bc06fa117c82" /> automodel backend finetuning moonlight 16B with ep8 8H100 <img width="1199" height="470" alt="image" src="https://github.com/user-attachments/assets/e63c7a43-293f-4acf-838c-eeb118f35ff9" /> automodel backend finetuning Qwen3 30B with ep8 8H100 <img width="1200" height="474" alt="image" src="https://github.com/user-attachments/assets/5ce33ea8-80e5-467f-8601-247f9602f977" /> automodel backend finetuning qwen2.5-7b with 4H100 fsdp2 <img width="1434" height="475" alt="image" src="https://github.com/user-attachments/assets/c0ed2a8c-39d4-4404-997c-03bb6b15139e" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rl-project#5407) ### What does this PR do? Add NeMo-Automodel as a training engine. The SFT trainer is tested with Qwen2.5-0.5B. - automodel engine matches exactly with FSDP engine for SFT trainer (TP1/TP2/rmpad=True/False) - use_remove_padding=True matches use_remove_padding=False - EP support tested with kimi moonlight 16B Relevant PRs: - RFC verl-project#5245 - Add VeOmni verl-project#4072 - Add TorchTitan verl-project#5051 ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test automodel backend 1gpu & 4gpu tp1/tp2 against FSDP backend 1gpu, rmpad true/false. <img width="1384" height="473" alt="image" src="https://github.com/user-attachments/assets/90ffa1e8-7d9b-4f66-a1c1-bc06fa117c82" /> automodel backend finetuning moonlight 16B with ep8 8H100 <img width="1199" height="470" alt="image" src="https://github.com/user-attachments/assets/e63c7a43-293f-4acf-838c-eeb118f35ff9" /> automodel backend finetuning Qwen3 30B with ep8 8H100 <img width="1200" height="474" alt="image" src="https://github.com/user-attachments/assets/5ce33ea8-80e5-467f-8601-247f9602f977" /> automodel backend finetuning qwen2.5-7b with 4H100 fsdp2 <img width="1434" height="475" alt="image" src="https://github.com/user-attachments/assets/c0ed2a8c-39d4-4404-997c-03bb6b15139e" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What does this PR do?
Add NeMo-Automodel as a training engine. The SFT trainer is tested with Qwen2.5-0.5B.
Relevant PRs:
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
automodel backend 1gpu & 4gpu tp1/tp2 against FSDP backend 1gpu, rmpad true/false.

automodel backend finetuning moonlight 16B with ep8 8H100

automodel backend finetuning Qwen3 30B with ep8 8H100

automodel backend finetuning qwen2.5-7b with 4H100 fsdp2

API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.