Skip to content

[misc, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k#6006

Merged
wuxibin89 merged 1 commit intoverl-project:mainfrom
Silas-11:examples/qwen3-vl-8b-async
Apr 15, 2026
Merged

[misc, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k#6006
wuxibin89 merged 1 commit intoverl-project:mainfrom
Silas-11:examples/qwen3-vl-8b-async

Conversation

@Silas-11
Copy link
Copy Markdown
Contributor

@Silas-11 Silas-11 commented Apr 14, 2026

What does this PR do?

Add a fully async GRPO training script for Qwen3-VL-8B on the geo3k dataset under verl/experimental/fully_async_policy/shell/.

Unlike the standard sync training script, this script separates training and rollout onto different GPU groups (fully async mode), improving GPU utilization by overlapping training and inference. Key differences from the sync script:

  • Uses verl.experimental.fully_async_policy.fully_async_main instead of main_ppo
  • Training and rollout GPUs are allocated independently via n_gpus_training / n_gpus_rollout
  • Adds async-specific parameters: staleness_threshold, trigger_parameter_sync_step, require_batches, partial_rollout
  • Applies rollout_correction (sequence-level TIS + geometric RS) for importance sampling correction under staleness

Checklist Before Starting

Test

Environment

Tested on Ascend NPU. Refer to Ascend Quickstart for full installation instructions. Core versions:

Software Version
CANN 8.5.0
torch 2.8.0
torch_npu 2.8.0
vllm 0.13.0
vllm-ascend 0.13.0
transformers 4.57.6

Results

Validated by a long-run experiment on geo3k with Qwen3-VL-8B. The critic rewards mean curve shows a stable upward trend from ~0.45 to ~0.60 over 70+ steps, with no reward hacking or training collapse observed.
img_v3_0210o_10ae7414-076f-459e-a843-78ae71e2618g

API and Usage Example

No API changes. To run:

bash verl/experimental/fully_async_policy/shellgeo3k_qwen3vl_8b_fsdp2_16_16_npu.sh

Design & Code Changes

Added verl/experimental/fully_async_policy/shell/geo3k_qwen3vl_8b_fsdp2_16_16_npu.sh.

Parameters are organized into named config blocks (DATA_CONFIG, ACTOR_CONFIG, REF_CONFIG, ROLLOUT_CONFIG, ALGORITHM_CONFIG, TRAINER_CONFIG, ASYNC_CONFIG) following the existing script convention in the repo.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a shell script for launching a fully asynchronous PPO training job for the Qwen3-VL-8B model on the geo3k dataset using FSDP2. Feedback suggests improving the portability of data paths by using environment variables and adjusting the GPU allocation to fit standard 8-accelerator node configurations to prevent potential execution hangs.

Comment on lines +7 to +8
train_path=$HOME/data/geo3k/train.parquet
test_path=$HOME/data/geo3k/test.parquet
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding data paths to $HOME makes the script non-portable and brittle across different environments (e.g., CI/CD or other developers' machines). It is better to use environment variables with these paths as defaults to allow for easier overrides.

Suggested change
train_path=$HOME/data/geo3k/train.parquet
test_path=$HOME/data/geo3k/test.parquet
train_path=${train_path:-"$HOME/data/geo3k/train.parquet"}
test_path=${test_path:-"$HOME/data/geo3k/test.parquet"}

Comment on lines +21 to +24
n_gpus_rollout=16
n_gpus_training=16
n_nodes_rollout=1
n_nodes_train=1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current configuration specifies 16 GPUs per node (n_gpus_training=16 and n_nodes_train=1). Most standard NPU/GPU nodes (such as Huawei Atlas 800 or NVIDIA H100/A100 clusters) contain 8 accelerators per node. If this script is executed on such hardware, Ray will fail to find a node with 16 available accelerators, causing the job to hang indefinitely. To utilize 16 accelerators total for each role, it is recommended to configure 2 nodes with 8 accelerators each.

Suggested change
n_gpus_rollout=16
n_gpus_training=16
n_nodes_rollout=1
n_nodes_train=1
n_gpus_rollout=8
n_gpus_training=8
n_nodes_rollout=2
n_nodes_train=2

@Silas-11 Silas-11 changed the title [examples, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k [misc, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k Apr 14, 2026
@wuxibin89 wuxibin89 merged commit 014fd56 into verl-project:main Apr 15, 2026
5 of 8 checks passed
huaiyizhao pushed a commit to huaiyizhao/verl that referenced this pull request Apr 15, 2026
…cript on geo3k (verl-project#6006)

### What does this PR do?

Add a fully async GRPO training script for **Qwen3-VL-8B** on the geo3k
dataset under `verl/experimental/fully_async_policy/shell/`.

Unlike the standard sync training script, this script separates training
and rollout onto different GPU groups (fully async mode), improving GPU
utilization by overlapping training and inference. Key differences from
the sync script:
- Uses `verl.experimental.fully_async_policy.fully_async_main` instead
of `main_ppo`
- Training and rollout GPUs are allocated independently via
`n_gpus_training` / `n_gpus_rollout`
- Adds async-specific parameters: `staleness_threshold`,
`trigger_parameter_sync_step`, `require_batches`, `partial_rollout`
- Applies `rollout_correction` (sequence-level TIS + geometric RS) for
importance sampling correction under staleness

### Checklist Before Starting
- [x] Search for similar PRs:
  - https://github.com/verl-project/verl/pulls?q=fully+async
  - https://github.com/verl-project/verl/pulls?q=Qwen3-VL+async
- [x] PR title: `[examples, fully_async] feat: add Qwen3-VL-8B fully
async GRPO training script on geo3k`
  - Modules: `examples`, `fully_async`
  - Type: `feat`
  - No `[BREAKING]` — new script only, no existing API changes

### Test
**Environment**

Tested on Ascend NPU. Refer to [Ascend
Quickstart](https://github.com/volcengine/verl/blob/main/docs/ascend_tutorial/quick_start/ascend_quick_start.rst)
for full installation instructions. Core versions:

| Software      | Version   |
|---------------|-----------|
| CANN          | 8.5.0     |
| torch         | 2.8.0     |
| torch_npu     | 2.8.0     |
| vllm          | 0.13.0    |
| vllm-ascend   | 0.13.0    |
| transformers  | 4.57.6    |

**Results**

Validated by a long-run experiment on geo3k with Qwen3-VL-8B. The critic
rewards mean curve shows a stable upward trend from ~0.45 to ~0.60 over
70+ steps, with no reward hacking or training collapse observed.
<img width="957" height="509"
alt="img_v3_0210o_10ae7414-076f-459e-a843-78ae71e2618g"
src="https://github.com/user-attachments/assets/769b2ae5-1faa-4dae-8fd2-aeee22708e43"
/>


### API and Usage Example

No API changes. To run:

```bash
bash verl/experimental/fully_async_policy/shellgeo3k_qwen3vl_8b_fsdp2_16_16_npu.sh
```
---

### Design & Code Changes

Added
`verl/experimental/fully_async_policy/shell/geo3k_qwen3vl_8b_fsdp2_16_16_npu.sh`.

Parameters are organized into named config blocks (`DATA_CONFIG`,
`ACTOR_CONFIG`, `REF_CONFIG`, `ROLLOUT_CONFIG`, `ALGORITHM_CONFIG`,
`TRAINER_CONFIG`, `ASYNC_CONFIG`) following the existing script
convention in the repo.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to the CI workflow. **Not
applicable** — this is a shell script example; validation is covered by
the experiment results above.
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
- [x] Not related to the `recipe` submodule.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants