[misc, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k by Silas-11 · Pull Request #6006 · verl-project/verl

Silas-11 · 2026-04-14T11:56:27Z

What does this PR do?

Add a fully async GRPO training script for Qwen3-VL-8B on the geo3k dataset under verl/experimental/fully_async_policy/shell/.

Unlike the standard sync training script, this script separates training and rollout onto different GPU groups (fully async mode), improving GPU utilization by overlapping training and inference. Key differences from the sync script:

Uses verl.experimental.fully_async_policy.fully_async_main instead of main_ppo
Training and rollout GPUs are allocated independently via n_gpus_training / n_gpus_rollout
Adds async-specific parameters: staleness_threshold, trigger_parameter_sync_step, require_batches, partial_rollout
Applies rollout_correction (sequence-level TIS + geometric RS) for importance sampling correction under staleness

Checklist Before Starting

Search for similar PRs:
- https://github.com/verl-project/verl/pulls?q=fully+async
- https://github.com/verl-project/verl/pulls?q=Qwen3-VL+async
PR title: [examples, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k
- Modules: examples, fully_async
- Type: feat
- No [BREAKING] — new script only, no existing API changes

Test

Environment

Tested on Ascend NPU. Refer to Ascend Quickstart for full installation instructions. Core versions:

Software	Version
CANN	8.5.0
torch	2.8.0
torch_npu	2.8.0
vllm	0.13.0
vllm-ascend	0.13.0
transformers	4.57.6

Results

Validated by a long-run experiment on geo3k with Qwen3-VL-8B. The critic rewards mean curve shows a stable upward trend from ~0.45 to ~0.60 over 70+ steps, with no reward hacking or training collapse observed.

API and Usage Example

No API changes. To run:

bash verl/experimental/fully_async_policy/shellgeo3k_qwen3vl_8b_fsdp2_16_16_npu.sh

Design & Code Changes

Added verl/experimental/fully_async_policy/shell/geo3k_qwen3vl_8b_fsdp2_16_16_npu.sh.

Parameters are organized into named config blocks (DATA_CONFIG, ACTOR_CONFIG, REF_CONFIG, ROLLOUT_CONFIG, ALGORITHM_CONFIG, TRAINER_CONFIG, ASYNC_CONFIG) following the existing script convention in the repo.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow. Not applicable — this is a shell script example; validation is covered by the experiment results above.
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace.
Not related to the recipe submodule.

…ng script on geo3k

gemini-code-assist

Code Review

This pull request introduces a shell script for launching a fully asynchronous PPO training job for the Qwen3-VL-8B model on the geo3k dataset using FSDP2. Feedback suggests improving the portability of data paths by using environment variables and adjusting the GPU allocation to fit standard 8-accelerator node configurations to prevent potential execution hangs.

gemini-code-assist · 2026-04-14T11:58:33Z

verl/experimental/fully_async_policy/shell/geo3k_qwen3vl_8b_fsdp2_16_16_npu.sh

+train_path=$HOME/data/geo3k/train.parquet
+test_path=$HOME/data/geo3k/test.parquet


Hardcoding data paths to $HOME makes the script non-portable and brittle across different environments (e.g., CI/CD or other developers' machines). It is better to use environment variables with these paths as defaults to allow for easier overrides.

Suggested change

train_path=$HOME/data/geo3k/train.parquet

test_path=$HOME/data/geo3k/test.parquet

train_path=${train_path:-"$HOME/data/geo3k/train.parquet"}

test_path=${test_path:-"$HOME/data/geo3k/test.parquet"}

gemini-code-assist · 2026-04-14T11:58:33Z

verl/experimental/fully_async_policy/shell/geo3k_qwen3vl_8b_fsdp2_16_16_npu.sh

+n_gpus_rollout=16
+n_gpus_training=16
+n_nodes_rollout=1
+n_nodes_train=1


The current configuration specifies 16 GPUs per node (n_gpus_training=16 and n_nodes_train=1). Most standard NPU/GPU nodes (such as Huawei Atlas 800 or NVIDIA H100/A100 clusters) contain 8 accelerators per node. If this script is executed on such hardware, Ray will fail to find a node with 16 available accelerators, causing the job to hang indefinitely. To utilize 16 accelerators total for each role, it is recommended to configure 2 nodes with 8 accelerators each.

Suggested change

n_gpus_rollout=16

n_gpus_training=16

n_nodes_rollout=1

n_nodes_train=1

n_gpus_rollout=8

n_gpus_training=8

n_nodes_rollout=2

n_nodes_train=2

…cript on geo3k (verl-project#6006) ### What does this PR do? Add a fully async GRPO training script for **Qwen3-VL-8B** on the geo3k dataset under `verl/experimental/fully_async_policy/shell/`. Unlike the standard sync training script, this script separates training and rollout onto different GPU groups (fully async mode), improving GPU utilization by overlapping training and inference. Key differences from the sync script: - Uses `verl.experimental.fully_async_policy.fully_async_main` instead of `main_ppo` - Training and rollout GPUs are allocated independently via `n_gpus_training` / `n_gpus_rollout` - Adds async-specific parameters: `staleness_threshold`, `trigger_parameter_sync_step`, `require_batches`, `partial_rollout` - Applies `rollout_correction` (sequence-level TIS + geometric RS) for importance sampling correction under staleness ### Checklist Before Starting - [x] Search for similar PRs: - https://github.com/verl-project/verl/pulls?q=fully+async - https://github.com/verl-project/verl/pulls?q=Qwen3-VL+async - [x] PR title: `[examples, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k` - Modules: `examples`, `fully_async` - Type: `feat` - No `[BREAKING]` — new script only, no existing API changes ### Test **Environment** Tested on Ascend NPU. Refer to [Ascend Quickstart](https://github.com/volcengine/verl/blob/main/docs/ascend_tutorial/quick_start/ascend_quick_start.rst) for full installation instructions. Core versions: | Software | Version | |---------------|-----------| | CANN | 8.5.0 | | torch | 2.8.0 | | torch_npu | 2.8.0 | | vllm | 0.13.0 | | vllm-ascend | 0.13.0 | | transformers | 4.57.6 | **Results** Validated by a long-run experiment on geo3k with Qwen3-VL-8B. The critic rewards mean curve shows a stable upward trend from ~0.45 to ~0.60 over 70+ steps, with no reward hacking or training collapse observed. <img width="957" height="509" alt="img_v3_0210o_10ae7414-076f-459e-a843-78ae71e2618g" src="https://github.com/user-attachments/assets/769b2ae5-1faa-4dae-8fd2-aeee22708e43" /> ### API and Usage Example No API changes. To run: ```bash bash verl/experimental/fully_async_policy/shellgeo3k_qwen3vl_8b_fsdp2_16_16_npu.sh ``` --- ### Design & Code Changes Added `verl/experimental/fully_async_policy/shell/geo3k_qwen3vl_8b_fsdp2_16_16_npu.sh`. Parameters are organized into named config blocks (`DATA_CONFIG`, `ACTOR_CONFIG`, `REF_CONFIG`, `ROLLOUT_CONFIG`, `ALGORITHM_CONFIG`, `TRAINER_CONFIG`, `ASYNC_CONFIG`) following the existing script convention in the repo. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to the CI workflow. **Not applicable** — this is a shell script example; validation is covered by the experiment results above. - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). - [x] Not related to the `recipe` submodule.

[examples, fully_async] feat: add Qwen3-VL-8B fully async GRPO traini…

b109a48

…ng script on geo3k

Silas-11 requested review from ArronHZG and wuxibin89 as code owners April 14, 2026 11:56

gemini-code-assist bot reviewed Apr 14, 2026

View reviewed changes

ArronHZG approved these changes Apr 14, 2026

View reviewed changes

Silas-11 changed the title ~~[examples, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k~~ [misc, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k Apr 14, 2026

wuxibin89 approved these changes Apr 15, 2026

View reviewed changes

wuxibin89 merged commit 014fd56 into verl-project:main Apr 15, 2026
5 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k#6006

[misc, fully_async] feat: add Qwen3-VL-8B fully async GRPO training script on geo3k#6006
wuxibin89 merged 1 commit intoverl-project:mainfrom
Silas-11:examples/qwen3-vl-8b-async

Silas-11 commented Apr 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		train_path=$HOME/data/geo3k/train.parquet
		test_path=$HOME/data/geo3k/test.parquet

Conversation

Silas-11 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Silas-11 commented Apr 14, 2026 •

edited

Loading