Commit 365df24
[fully_async, rollout, trainer, tool, cfg] fix: ROCm async training compatibility for AMD MI300X (#6062)
## What does this PR do?
Fix multiple issues that prevent fully async FSDP2 training from working
on AMD ROCm platforms (MI300X series).
**Environment:**
- AMD Instinct MI3xx (8× GPU, 192 GB HBM each), ROCm 7.2, PyTorch
2.10+rocm7.2, vLLM v0.18.1rc1
- Cross-validated on NVIDIA H20 with CUDA (no regression observed)
**Training curves (MI3xx vs H20) and training script will be attached in
PR comments.**
[dapo_7b_fully_async.sh](https://github.com/user-attachments/files/26700697/dapo_7b_fully_async.sh)
<img width="2234" height="1181" alt="qwen2 5_7b_fully_async_1"
src="https://github.com/user-attachments/assets/81bd6651-9f1e-4450-b8b9-68149264536f"
/>
### Checklist Before Starting
- [x] Search for similar PRs:
https://github.com/verl-project/verl/pulls?q=is%3Apr+rocm+async
- [x] Format: `[{modules}] {type}: {description}`
### Test
Validated by full async FSDP2 DAPO/GRPO RL + ReTool training on AMD
MI3xx:
- 140+ training steps completed without errors, deadlocks or OOM
- Reward improved from -0.8 to 0 over 12+ hours of training
- Cross-validated on NVIDIA H20: all changes are platform-safe
(AMD-specific env vars are ignored on NVIDIA/Ascend; ZMQ handle changes
use platform-independent rank logic)
### API and Usage Example
No API changes. All fixes are internal implementation details.
### Design & Code Changes
1. **Add `HSA_NO_SCRATCH_RECLAIM` env var** (`constants_ppo.py`)
- Required by AMD RCCL on MI300X; without it, FSDP initialization fails
with `ncclSystemError`
- Added alongside existing platform-specific vars (HCCL for Ascend);
ignored on non-AMD platforms
2. **Fix `numpy.bool_` JSON serialization** (`ray_trainer.py`)
- Add `default=str` fallback for `json.dumps` since numpy 2.x `bool_` is
no longer a Python `bool` subclass
3. **ZMQ IPC handle: use `(replica_rank, local_rank)` instead of GPU
UUID** (`vllm_rollout.py`, `utils.py`, `vllm_async_server.py`)
- On ROCm, `CheckpointEngineWorker` and vLLM worker see different GPU
UUIDs due to different `CUDA_VISIBLE_DEVICES`/`HIP_VISIBLE_DEVICES`
settings
- Sender uses `rollout_rank % local_world_size` to derive node-local
rank, matching vLLM worker's `local_rank` on every node (fixes
multi-node mismatch)
- `replica_rank` prefix avoids socket collisions when multiple replicas
share a node
- `VERL_REPLICA_RANK` env var is set in `vLLMHttpServer.__init__` and
inherited by vLLM worker subprocesses
- Both `vLLMColocateWorkerExtension` and
`vLLMOmniColocateWorkerExtension` are updated
4. **Clean up stale ZMQ IPC socket files**
(`bucketed_weight_transfer.py`)
- Remove leftover `.sock` files before `bind()` and after `_cleanup()`
to prevent `Address already in use` on restart
5. **Fix Hydra searchpath** (`fully_async_ppo_trainer.yaml`)
- Use `pkg://verl.trainer.config` instead of
`file://verl/trainer/config` for editable installs
6. **Sandbox Ray actor reuse** (`sandbox_fusion_tools.py`)
- Add `name` and `get_if_exists=True` to prevent duplicate
`ExecutionWorker` actor creation
### Platform Compatibility
All changes are safe for non-AMD platforms:
- `HSA_NO_SCRATCH_RECLAIM`: AMD-specific env var, silently ignored on
NVIDIA/Ascend
- ZMQ handle changes: use platform-independent rank arithmetic;
`VERL_REPLICA_RANK` defaults to `"0"` for single-replica setups
- Other changes (json default=str, IPC cleanup, Hydra pkg://, actor
reuse) are pure logic improvements
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs). The
official documents will be compiled after the merger.
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: These fixes target
ROCm-specific runtime behavior (HIP memory management, RCCL env vars,
GPU UUID mismatch) that cannot be reproduced in CI without AMD GPU
hardware.
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [x] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.
---------
Co-authored-by: root <root@mi308-ccs-aus-e04-40.prov.aus.ccs.cpe.ice.amd.com>1 parent f2b1c98 commit 365df24
8 files changed
Lines changed: 39 additions & 12 deletions
File tree
- verl
- experimental/fully_async_policy/config
- tools
- trainer
- ppo
- workers/rollout/vllm_rollout
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
90 | 90 | | |
91 | 91 | | |
92 | 92 | | |
93 | | - | |
| 93 | + | |
94 | 94 | | |
95 | 95 | | |
96 | 96 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
| 34 | + | |
34 | 35 | | |
35 | 36 | | |
36 | 37 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
422 | 422 | | |
423 | 423 | | |
424 | 424 | | |
425 | | - | |
| 425 | + | |
426 | 426 | | |
427 | 427 | | |
428 | 428 | | |
| |||
Lines changed: 12 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
155 | 155 | | |
156 | 156 | | |
157 | 157 | | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
158 | 164 | | |
159 | 165 | | |
160 | 166 | | |
| |||
185 | 191 | | |
186 | 192 | | |
187 | 193 | | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
188 | 200 | | |
189 | 201 | | |
190 | 202 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
264 | 264 | | |
265 | 265 | | |
266 | 266 | | |
267 | | - | |
268 | | - | |
269 | | - | |
270 | | - | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
271 | 274 | | |
272 | 275 | | |
273 | 276 | | |
| |||
330 | 333 | | |
331 | 334 | | |
332 | 335 | | |
333 | | - | |
334 | | - | |
335 | | - | |
336 | | - | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
337 | 343 | | |
338 | 344 | | |
339 | 345 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
108 | 108 | | |
109 | 109 | | |
110 | 110 | | |
| 111 | + | |
111 | 112 | | |
112 | 113 | | |
113 | 114 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
98 | | - | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
99 | 106 | | |
100 | 107 | | |
101 | 108 | | |
| |||
0 commit comments