[megatron] fix: the NPU error that occurs after migrating from megatron worker to engine worker.#6135
Conversation
…rker to engine worker.
There was a problem hiding this comment.
Code Review
This pull request introduces MindspeedEngineWithValueHead and MegatronEngineWithValueHead across the engine worker modules to support value models. The review feedback identifies several improvement opportunities: an unused import of repatch in engine_workers.py should be removed, a redundant init method in MindspeedEngineWithValueHead can be deleted, and the duplicated patching logic in _init_device_mesh should be refactored into a shared helper function to improve maintainability.
| try: | ||
| from verl.workers.engine.mindspeed.transformer_impl import repatch | ||
| except ImportError: | ||
| repatch = None | ||
|
|
There was a problem hiding this comment.
The import of repatch in verl/workers/engine_workers.py is unused and redundant. The repatch function is already imported and called within the specific engine implementations (e.g., in verl/workers/engine/mindspeed/transformer_impl.py) during their initialization. Adding it here at the top level of the worker file introduces unnecessary module loading and dead code.
| def __init__( | ||
| self, | ||
| model_config: HFModelConfig, | ||
| engine_config: McoreEngineConfig, | ||
| optimizer_config: McoreOptimizerConfig, | ||
| checkpoint_config: CheckpointConfig, | ||
| ): | ||
| super().__init__(model_config, engine_config, optimizer_config, checkpoint_config) |
There was a problem hiding this comment.
…rker to engine worker.
|
Please format code before commit: https://github.com/verl-project/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting |
…on worker to engine worker.
…on worker to engine worker.
…on worker to engine worker.
What does this PR do?
After migrating from megatron worker to engine worker, there is no longer Critic worker to use.
Fix the error occurs on NPU: AssertionError: Unknown device: npu for model_type: value_model and backend: megatron.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,vllm_omni,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.