-
Notifications
You must be signed in to change notification settings - Fork 269
Description
Checklist
- This feature will maintain backward compatibility with the current APIs in
areal/api/. If not, please raise a refactor issue first.
Background
Currently, AReaL supports both SPMD and single-controller mode. SPMD mode supports a local launcher for single-node training, or Ray and Slurm launchers for distributed training. However, the single-controller mode supports only the local scheduler. This limitation makes it challenging to run large-scale distributed training on GPU/NPU clusters using the more user-friendly single-controller mode.
Potential Solution
Following the design doc, we propose implementing a multi-node scheduler based on Ray.
Overview
Similarly to how the LocalScheduler interacts with (HTTP) RPC servers, we plan to implement:
- A
RayScheduleras the RPC client. - A
RayRPCServerRay actor class as RPC server.
For communication, we will use Ray's native actor-based RPC mechanism.
RayScheduler
The RayScheduler object will be created by the single-controller user script. It must implement the Scheduler API defined here.
In the current SPMD Ray launcher, there are two placement groups (PGs): one for rollout and one for training.
To better support potential elastic scaling, we propose creating one PG for each "instance":
- Rollout: one PG per SGLang/vLLM server, which may manage multiple accelerators internally.
- Training: currently fixed to one PG; we may revisit this when introducing a potential
RemoteTrainEngine.
We will reuse the utility function get_placement_group_master_ip_and_port() to determine master IPs and ports.
Using these PGs, we will create RayRPCServer actors as described below.
RayRPCServer
Each worker will host a RayRPCServer actor that implements the following methods to match that of the HTTP RPC server:
healthconfigureset_envcreate_enginecall
In particular, we would like to avoid (un)pickling Platform objects and properly delay the initialization of current_platform in the Ray worker processes.
We plan to refine the design and implementation details over the following weeks.
Additional Information
N/A