You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[fully_async] feat: reuse trainer worker group for hybrid rollout to do validation (#6076)
### Overview
By dynamically adding or removing replicas at runtime, this PR fixes the
`use_trainer_do_validate` capability that was
broken in fully-async training mode.
Furthermore, it provides the necessary infrastructure components for
future elastic-scheduling / resilience building.

### What Changed
#### 1. Merged Handle Registry into GlobalRequestLoadBalancer
The original architecture used a local `servers: dict[str, ActorHandle]`
cache inside each `LLMServerClient`.
This made elastic scaling impossible without broadcasting updates to
every client/worker.
**Before (2 RPCs per acquire):**
```
server_id = LB.acquire(request_id) # RPC 1
handle = client.servers[server_id] # local lookup (stale if elastic add/remove happened)
```
**After (1 atomic RPC per acquire):**
```
(server_id, handle) = LB.acquire(request_id) # single RPC, always consistent
```
The `GlobalRequestLoadBalancer` now owns both the routing pool
(`_inflight_requests`) and the handle mapping (
`_servers`).
Elastic `add_replica()` / `remove_replica()` each require only **one Ray
RPC** — no client/worker notification needed.
#### 2. FullyAsyncLLMServerManager: Two-Phase Initialization + Elastic
Lifecycle
New subclass of `LLMServerManager` that supports:
- **Phase 1 — Elastic hybrid replicas** (rank 0..N_e-1): Backed by
trainer GPUs via injected worker group;
initialized then immediately slept to free GPU memory for training
- **Phase 2 — Fixed standalone replicas** (rank N_e..N_e+N_f-1): On
dedicated rollout GPUs
- Runtime `add_replica(resource_id)` / `remove_replica(resource_id)`
with atomic LB operations
#### 3. Trainer-Side Validation (`use_trainer_do_validate=True`)
Previously broken/asserted-out in fully-async mode. Now fully functional
via a three-phase validation cycle:
| Phase | Action |
|------------------------|-----------------------------------------------------------------------------------------------|
| **1. TRAIN → ROLLOUT** | Sync weights → abort all replicas → activate
elastic replicas in LB → resume generation |
| **2. Validate** | Execute validation via RPC to rollouter |
| **3. ROLLOUT → TRAIN** | Abort all replicas → deactivate elastic
replicas → sleep elastic GPUs → resume fixed replicas |
Uses a dedicated `hybrid_checkpoint_manager` (naive backend) for the
elastic replica pool,
separate from the existing `checkpoint_manager` for fixed rollout
replicas.
#### 4. KV-Cache-Only Weight Sync Optimization
vLLM's `sleep(level=1)` mode allows restoring **weights first, then KV
cache separately**.
During parameter synchronization we now call `release_kv_cache` → NCCL
sync → `resume_kv_cache`
instead of the heavier full `sleep` → sync → `wake_up` cycle, reducing
memory pressure.
#### 5. Abort State Tracking
Both vLLM and SGLang servers now track an `_is_aborted` flag:
- `abort_all_requests()` sets it → subsequent `generate()` calls return
immediately with `stop_reason="aborted"`
- `resume_generation()` clears it
- Prevents post-abort processing errors (e.g., `IndexError` on empty
outputs)
### Design Goals
1. **Single Responsibility**: Manager owns lifecycle; LB handles routing
+ handle mapping (merged); Client only sends
requests
2. **Elastic Convergence**: replica add/remove operates on a single LB
Ray Actor (internal handle registry); no
client/worker notification
3. **Elastic Resources**: `FullyAsyncLLMServerManager` implements
elastic resource registration with two-phase init
4. **Trainer-side Validation**: `use_trainer_do_validate=True` supported
via elastic hybrid replicas on trainer GPUs
### Checklist Before Starting
- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`vllm_omni`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`,
`hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`,
`perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`,
`reward`, `fully_async`, `one_step_off`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/verl-project/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/verl-project/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [x] Add / Update [the
documentation](https://github.com/verl-project/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/verl-project/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [ ] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.
0 commit comments