[Feature] RayScheduler support in single-controller mode

## Checklist

- [x] This feature will maintain backward compatibility with the current APIs in
  `areal/api/`. If not, please raise a refactor issue first.

## Background

Currently, AReaL supports both SPMD and single-controller mode. SPMD mode supports a local launcher for single-node training, or Ray and Slurm launchers for distributed training. However, the single-controller mode supports only the [local scheduler](https://github.com/inclusionAI/AReaL/blob/main/areal/scheduler/local.py). This limitation makes it challenging to run large-scale distributed training on GPU/NPU clusters using the more user-friendly single-controller mode. 

## Potential Solution

Following the [design doc](https://github.com/inclusionAI/AReaL/blob/fw/single-controller-doc/docs/design/single_controller.md), we propose implementing a multi-node scheduler based on Ray.

### Overview

Similarly to how the `LocalScheduler` interacts with (HTTP) RPC servers, we plan to implement:
+ A `RayScheduler` as the RPC client.
+ A `RayRPCServer` Ray actor class as RPC server.

For communication, we will use Ray's native actor-based RPC mechanism.

### RayScheduler
The `RayScheduler` object will be created by the single-controller user script. It must implement the `Scheduler` API defined [here](https://github.com/inclusionAI/AReaL/blob/main/areal/api/scheduler_api.py).

In the current SPMD Ray launcher, there are two placement groups (PGs): one for rollout and one for training. 
To better support potential elastic scaling, we propose creating one PG for each "instance":
+ Rollout: one PG per SGLang/vLLM server, which may manage multiple accelerators internally.
+ Training: currently fixed to one PG; we may revisit this when introducing a potential `RemoteTrainEngine`.

We will reuse the utility function [get_placement_group_master_ip_and_port()](https://github.com/inclusionAI/AReaL/blob/main/areal/utils/ray.py) to determine master IPs and ports.

Using these PGs, we will create `RayRPCServer` actors as described below.


### RayRPCServer
Each worker will host a `RayRPCServer` actor that implements the following methods to match that of the HTTP RPC server:
+ `health`
+ `configure`
+ `set_env`
+ `create_engine`
+ `call`

In particular, we would like to avoid (un)pickling `Platform` objects and properly delay the initialization of `current_platform` in the Ray worker processes.

We plan to refine the design and implementation details over the following weeks.

## Additional Information

N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] RayScheduler support in single-controller mode #661

Checklist

Background

Potential Solution

Overview

RayScheduler

RayRPCServer

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] RayScheduler support in single-controller mode #661

Description

Checklist

Background

Potential Solution

Overview

RayScheduler

RayRPCServer

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions