Skip to content

[Feature] RayScheduler support in single-controller mode #661

@HwVanICI

Description

@HwVanICI

Checklist

  • This feature will maintain backward compatibility with the current APIs in
    areal/api/. If not, please raise a refactor issue first.

Background

Currently, AReaL supports both SPMD and single-controller mode. SPMD mode supports a local launcher for single-node training, or Ray and Slurm launchers for distributed training. However, the single-controller mode supports only the local scheduler. This limitation makes it challenging to run large-scale distributed training on GPU/NPU clusters using the more user-friendly single-controller mode.

Potential Solution

Following the design doc, we propose implementing a multi-node scheduler based on Ray.

Overview

Similarly to how the LocalScheduler interacts with (HTTP) RPC servers, we plan to implement:

  • A RayScheduler as the RPC client.
  • A RayRPCServer Ray actor class as RPC server.

For communication, we will use Ray's native actor-based RPC mechanism.

RayScheduler

The RayScheduler object will be created by the single-controller user script. It must implement the Scheduler API defined here.

In the current SPMD Ray launcher, there are two placement groups (PGs): one for rollout and one for training.
To better support potential elastic scaling, we propose creating one PG for each "instance":

  • Rollout: one PG per SGLang/vLLM server, which may manage multiple accelerators internally.
  • Training: currently fixed to one PG; we may revisit this when introducing a potential RemoteTrainEngine.

We will reuse the utility function get_placement_group_master_ip_and_port() to determine master IPs and ports.

Using these PGs, we will create RayRPCServer actors as described below.

RayRPCServer

Each worker will host a RayRPCServer actor that implements the following methods to match that of the HTTP RPC server:

  • health
  • configure
  • set_env
  • create_engine
  • call

In particular, we would like to avoid (un)pickling Platform objects and properly delay the initialization of current_platform in the Ray worker processes.

We plan to refine the design and implementation details over the following weeks.

Additional Information

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions