[misc, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k#6006
Conversation
…ng script on geo3k
There was a problem hiding this comment.
Code Review
This pull request introduces a shell script for launching a fully asynchronous PPO training job for the Qwen3-VL-8B model on the geo3k dataset using FSDP2. Feedback suggests improving the portability of data paths by using environment variables and adjusting the GPU allocation to fit standard 8-accelerator node configurations to prevent potential execution hangs.
| train_path=$HOME/data/geo3k/train.parquet | ||
| test_path=$HOME/data/geo3k/test.parquet |
There was a problem hiding this comment.
Hardcoding data paths to $HOME makes the script non-portable and brittle across different environments (e.g., CI/CD or other developers' machines). It is better to use environment variables with these paths as defaults to allow for easier overrides.
| train_path=$HOME/data/geo3k/train.parquet | |
| test_path=$HOME/data/geo3k/test.parquet | |
| train_path=${train_path:-"$HOME/data/geo3k/train.parquet"} | |
| test_path=${test_path:-"$HOME/data/geo3k/test.parquet"} |
| n_gpus_rollout=16 | ||
| n_gpus_training=16 | ||
| n_nodes_rollout=1 | ||
| n_nodes_train=1 |
There was a problem hiding this comment.
The current configuration specifies 16 GPUs per node (n_gpus_training=16 and n_nodes_train=1). Most standard NPU/GPU nodes (such as Huawei Atlas 800 or NVIDIA H100/A100 clusters) contain 8 accelerators per node. If this script is executed on such hardware, Ray will fail to find a node with 16 available accelerators, causing the job to hang indefinitely. To utilize 16 accelerators total for each role, it is recommended to configure 2 nodes with 8 accelerators each.
| n_gpus_rollout=16 | |
| n_gpus_training=16 | |
| n_nodes_rollout=1 | |
| n_nodes_train=1 | |
| n_gpus_rollout=8 | |
| n_gpus_training=8 | |
| n_nodes_rollout=2 | |
| n_nodes_train=2 |
…cript on geo3k (verl-project#6006) ### What does this PR do? Add a fully async GRPO training script for **Qwen3-VL-8B** on the geo3k dataset under `verl/experimental/fully_async_policy/shell/`. Unlike the standard sync training script, this script separates training and rollout onto different GPU groups (fully async mode), improving GPU utilization by overlapping training and inference. Key differences from the sync script: - Uses `verl.experimental.fully_async_policy.fully_async_main` instead of `main_ppo` - Training and rollout GPUs are allocated independently via `n_gpus_training` / `n_gpus_rollout` - Adds async-specific parameters: `staleness_threshold`, `trigger_parameter_sync_step`, `require_batches`, `partial_rollout` - Applies `rollout_correction` (sequence-level TIS + geometric RS) for importance sampling correction under staleness ### Checklist Before Starting - [x] Search for similar PRs: - https://github.com/verl-project/verl/pulls?q=fully+async - https://github.com/verl-project/verl/pulls?q=Qwen3-VL+async - [x] PR title: `[examples, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k` - Modules: `examples`, `fully_async` - Type: `feat` - No `[BREAKING]` — new script only, no existing API changes ### Test **Environment** Tested on Ascend NPU. Refer to [Ascend Quickstart](https://github.com/volcengine/verl/blob/main/docs/ascend_tutorial/quick_start/ascend_quick_start.rst) for full installation instructions. Core versions: | Software | Version | |---------------|-----------| | CANN | 8.5.0 | | torch | 2.8.0 | | torch_npu | 2.8.0 | | vllm | 0.13.0 | | vllm-ascend | 0.13.0 | | transformers | 4.57.6 | **Results** Validated by a long-run experiment on geo3k with Qwen3-VL-8B. The critic rewards mean curve shows a stable upward trend from ~0.45 to ~0.60 over 70+ steps, with no reward hacking or training collapse observed. <img width="957" height="509" alt="img_v3_0210o_10ae7414-076f-459e-a843-78ae71e2618g" src="https://github.com/user-attachments/assets/769b2ae5-1faa-4dae-8fd2-aeee22708e43" /> ### API and Usage Example No API changes. To run: ```bash bash verl/experimental/fully_async_policy/shellgeo3k_qwen3vl_8b_fsdp2_16_16_npu.sh ``` --- ### Design & Code Changes Added `verl/experimental/fully_async_policy/shell/geo3k_qwen3vl_8b_fsdp2_16_16_npu.sh`. Parameters are organized into named config blocks (`DATA_CONFIG`, `ACTOR_CONFIG`, `REF_CONFIG`, `ROLLOUT_CONFIG`, `ALGORITHM_CONFIG`, `TRAINER_CONFIG`, `ASYNC_CONFIG`) following the existing script convention in the repo. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to the CI workflow. **Not applicable** — this is a shell script example; validation is covered by the experiment results above. - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). - [x] Not related to the `recipe` submodule.
What does this PR do?
Add a fully async GRPO training script for Qwen3-VL-8B on the geo3k dataset under
verl/experimental/fully_async_policy/shell/.Unlike the standard sync training script, this script separates training and rollout onto different GPU groups (fully async mode), improving GPU utilization by overlapping training and inference. Key differences from the sync script:
verl.experimental.fully_async_policy.fully_async_maininstead ofmain_ppon_gpus_training/n_gpus_rolloutstaleness_threshold,trigger_parameter_sync_step,require_batches,partial_rolloutrollout_correction(sequence-level TIS + geometric RS) for importance sampling correction under stalenessChecklist Before Starting
[examples, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3kexamples,fully_asyncfeat[BREAKING]— new script only, no existing API changesTest
Environment
Tested on Ascend NPU. Refer to Ascend Quickstart for full installation instructions. Core versions:
Results
Validated by a long-run experiment on geo3k with Qwen3-VL-8B. The critic rewards mean curve shows a stable upward trend from ~0.45 to ~0.60 over 70+ steps, with no reward hacking or training collapse observed.

API and Usage Example
No API changes. To run:
Design & Code Changes
Added
verl/experimental/fully_async_policy/shell/geo3k_qwen3vl_8b_fsdp2_16_16_npu.sh.Parameters are organized into named config blocks (
DATA_CONFIG,ACTOR_CONFIG,REF_CONFIG,ROLLOUT_CONFIG,ALGORITHM_CONFIG,TRAINER_CONFIG,ASYNC_CONFIG) following the existing script convention in the repo.Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace.recipesubmodule.