[sglang, rollout] feat: support sglang as rollout engine in fully async policy#4191
Conversation
|
Hi,I tried your PR and attempted to replace FSDP with Megatron, but I encountered this error. Have you come across it before? |
sry, i have not adapted ot for megatron and encounter the same issue yet. |
thanks, I have fixed it |
|
How do you fix this problem?, I have met the same error @lizipao
|
我在recipe\fully_async_policy\fully_async_trainer.py里加了 |
17d9bad to
815ebb2
Compare
|
@chenhaiq @zhaochenyang20 could you please trigger the ci ? |
ArronHZG
left a comment
There was a problem hiding this comment.
I think the README should also be updated to indicate that SGLang is now supported, along with supplementary experimental data.
3 wiki:
https://github.com/volcengine/verl/blob/main/docs/advance/fully_async.md
https://github.com/volcengine/verl/blob/main/recipe/fully_async_policy/README.md
The two above are exactly the same.
| await asyncio.gather(*self.active_tasks, return_exceptions=True) | ||
| self.active_tasks.clear() | ||
| print("[FullyAsyncRollouter][Public][Pause] All active tasks completed") | ||
| print("[FullyAsyncRollouter][Public][Pause] Ready to reset prefix cache") |
There was a problem hiding this comment.
Should we unify the use of clear_kv_cache as the interface here? Modifications can be made by rebasing on the main branch.
| "mem_fraction_static": self.config.gpu_memory_utilization, | ||
| "disable_cuda_graph": self.config.enforce_eager, | ||
| "enable_memory_saver": True, | ||
| "enable_memory_saver": False, |
There was a problem hiding this comment.
Will this affect the existing logic?
…er synchronization time.
|
We further reduced the parameter synchronization time in f6c7589. Experiments conducted on 32 H20 GPUs, using data from step 20 to 120, show that the average parameter synchronization time decreased from 10.36 seconds to 1.34 seconds,reduced by approximately 87%. |
e0ba55d to
f424edb
Compare
| ray.get(dependency_ref) | ||
| print("[FullyAsyncRollouter][Public][Resume]") | ||
| async with self.lock: | ||
| if self.config.async_training.partial_rollout: |
There was a problem hiding this comment.
why this line if be removed?
There was a problem hiding this comment.
Yep,this line is added back!
| if self.vanilla_bridge: | ||
| from verl.models.mcore.mbridge import AutoBridge | ||
|
|
||
| bridge = AutoBridge.from_config(self.model_config.hf_config, dtype=self.param_dtype) |
There was a problem hiding this comment.
New mbridge version works!Fine!
|
|
||
| async with self.lock: | ||
| while self.paused: | ||
| self.idle_start_time = time.time() |
There was a problem hiding this comment.
but idle_start_time only set once when idle_start_time is None ?
Is this right
verl/workers/megatron_workers.py
Outdated
| rollout_device_mesh["infer_tp"].get_local_rank() == 0 | ||
| and rollout_device_mesh["infer_pp"].get_local_rank() == 0 | ||
| ) | ||
| if self.config.rollout.mode == "async" and self.config.rollout.name == "sglang": |
There was a problem hiding this comment.
Will there be any code duplication here?
|
|
||
|
|
||
| @ray.remote(num_cpus=1) | ||
| class SGLangHttpServer: |
There was a problem hiding this comment.
可以参考下vllm当前的改动,这里去掉 @ray.remote(num_cpus=1)
self.server_class = ray.remote(SGLangHttpServer)
verl/workers/megatron_workers.py
Outdated
| rollout_device_mesh["infer_tp"].get_local_rank() == 0 | ||
| and rollout_device_mesh["infer_pp"].get_local_rank() == 0 | ||
| ) | ||
|
|
…nc policy (verl-project#4191) ### What does this PR do? Extend the fully async policy recipe by adding SGLang as an alternative rollout engine to vLLM when using FSDP ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: jsfanfanfan <2981866535@qq.com> Co-authored-by: jsfanfanfan <2981856535@qq.com> Co-authored-by: jsfanfanfan <71052636+jsfanfanfan@users.noreply.github.com>
…nc policy (verl-project#4191) ### What does this PR do? Extend the fully async policy recipe by adding SGLang as an alternative rollout engine to vLLM when using FSDP ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: jsfanfanfan <2981866535@qq.com> Co-authored-by: jsfanfanfan <2981856535@qq.com> Co-authored-by: jsfanfanfan <71052636+jsfanfanfan@users.noreply.github.com>
…nc policy (verl-project#4191) ### What does this PR do? Extend the fully async policy recipe by adding SGLang as an alternative rollout engine to vLLM when using FSDP ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: jsfanfanfan <2981866535@qq.com> Co-authored-by: jsfanfanfan <2981856535@qq.com> Co-authored-by: jsfanfanfan <71052636+jsfanfanfan@users.noreply.github.com>
…nc policy (verl-project#4191) ### What does this PR do? Extend the fully async policy recipe by adding SGLang as an alternative rollout engine to vLLM when using FSDP ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: jsfanfanfan <2981866535@qq.com> Co-authored-by: jsfanfanfan <2981856535@qq.com> Co-authored-by: jsfanfanfan <71052636+jsfanfanfan@users.noreply.github.com>






What does this PR do?
Extend the fully async policy recipe by adding SGLang as an alternative rollout engine to vLLM when using FSDP
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)