Skip to content

[RFC] Elastic EP Scale #22788

@zackyoray

Description

@zackyoray

Motivation

Today, large MoE models (like DeepSeek V3) run with a fixed number of GPUs decided at launch time. The process groups, expert sharding, and dispatch paths are all set up once and never change.

That's fine until it isn't:

  • Traffic spikes and you need more capacity, but the server is already running.
  • A cluster job gets preempted or hardware goes down and you lose GPUs.
  • Off-peak hours and you're paying for GPUs you don't need.

Right now the only option is to restart the whole server with a different topology. That means killing in-flight requests, waiting for model reload, and hoping nothing else breaks.

This RFC proposes adding and removing EP ranks at runtime, without restarting.

Scenario What happens
Load spike Add GPU ranks on the fly
Hardware reclaim Remove ranks gracefully
Fault recovery Replace a crashed rank (already in PR #15771)
Cost savings Shrink to fewer ranks during low traffic

Background

Expert Parallelism in SGLang

EP splits the MoE expert layers across ep_size GPU ranks. Each rank holds a subset of experts. During a forward pass, tokens get dispatched to the right expert rank (via NIXL or Mooncake all-to-all), computed locally, then combined back.

active_ranks

SGLang already has a tensor that tracks which EP ranks are alive:

active_ranks: torch.Tensor  # shape [max_ep_size], dtype int32
                             # 1 = rank is participating, 0 = inactive

This tensor is read by MoE dispatchers, EPLB, process groups, the scheduler, and ElasticEPStateManager. The key idea of this RFC is that scaling is just flipping bits in this tensor, the same thing fault tolerance already does.

Dynamic group membership (Mooncake PG backend)

The Mooncake process group backend provides APIs for changing rank membership on the fly:

API What it does
extend_group_size_to(N) Grows the process group to N slots. Doesn't block.
get_peer_state(ranks) Checks if all active peers can see a rank as connected. Non-blocking poll.
recover_ranks(ranks) Marks ranks as active and publishes sync state so joining ranks can finish their setup.
join_group() Called by the joining rank. Publishes its metadata, waits for connections, reads sync state.

The isExtension flag lets a new rank create its process group without blocking during init; it defers the actual connection to join_group() later.

PR #15771: Rank Recovery

PR #15771 added the ability to recover a crashed rank: the admin relaunches it, it creates process groups in extension mode, loads the model, and calls join_group(). Meanwhile, healthy ranks keep serving and poll each forward pass with get_peer_state(). When the recovered rank is ready, they call recover_ranks() and it rejoins.

The key insight here: scaling can piggyback on this exact same flow. The only difference is that recovery fills an existing slot (the rank was there before), while scale-up adds slots beyond the original group. That means we need one extra call (extend_group_size_to()) to make the group aware of the new slots before the recovery-style join can kick in.

MoE buffer connections (NIXL)

NIXL manages connections between ranks via connect_ranks() and disconnect_ranks() work per-pair, so there's no need to rebuild everything. Buffers are pre-allocated to max_ep_size at init, so scaling only changes which connections are active.

One important detail: connect_ranks() is a two-sided handshake. This means both sides need to be calling connect_ranks() at roughly the same time.


Design Considerations

Reusing the recovery interface

PR #15771 already solves the hard problem of dynamically joining a rank into a live serving cluster: extension mode init, join_group, non-blocking polling, recover_ranks. That machinery works for any rank that wasn't present at startup, not just crashed ones. Rather than building a separate scaling system, we add one call (extend_group_size_to) to grow the group and reuse everything else. This keeps the codebase simpler and means both paths benefit from the same bug fixes.

API style

We considered three ways to express a scale operation:

Style Example Tradeoff
Absolute target (chosen) { "new_tp_size": 8 } Idempotent, simple validation, maps directly to extend_group_size_to(N)
Rank-explicit { "rank_ids": [4,5,6,7] } More flexible (supports non-contiguous recovery), but requires deeper changes across ElasticEPStateManager, buffer backends, and DP shard validation
Delta { "count": 4 } Human-friendly, but not idempotent

We go with absolute target for the initial design. Rank-explicit can come later once the core path is proven.

Rank discovery

Existing ranks discover new ranks the same way they discover recovered ranks in #15771: a non-blocking get_peer_state() poll that runs at the end of each forward pass. The scale API doesn't introduce a new discovery mechanism; it just widens the poll range by bumping effective_ep_size so the new slots become visible.

MoE buffer connection ordering

NIXL connections update lazily: each dispatch checks _scale_to != _connected_ep_size and calls connect_ranks if needed. The critical constraint is that on_scale() must be called after activate_ranks(), not in the HTTP handler. If triggered too early, connect_ranks blocks in the dispatch path waiting for the new rank's metadata, but the new rank can't publish that metadata until its join_group() completes, which needs activate_ranks(). Calling on_scale earlier creates a circular wait. Details in the MoE buffer ordering section below.

Distinguishing scale from recovery

The joining rank declares its intent with --ep-join-mode scale or --ep-join-mode recover. On the existing rank side, we need a way to determine whether joining ranks are recovering (stale metadata → skip EPLB) or scaling (fresh → let EPLB fire). We wrap this in is_recovery_join(rank_ids). The exact implementation is TBD (could use rank index, PG join metadata, or the flag itself), but the contract is clear: "were these ranks previously active?"


Design Proposal

System Architecture

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'primaryColor': '#76b900',
    'primaryTextColor': '#fff',
    'primaryBorderColor': '#000',
    'lineColor': '#000',
    'secondaryColor': '#0071c5',
    'tertiaryColor': '#5e5e5e',
    'background': '#fff',
    'mainBkg': '#76b900',
    'secondBkg': '#0071c5',
    'tertiaryBkg': '#5e5e5e'
  }
}}%%
flowchart TB
    subgraph external ["External"]
        Orch["Orchestrator"]
    end

    subgraph sglang_server ["SGLang Server"]
        HTTP["POST /scale_elastic_ep"]
        ZMQ["ZMQ Pipeline"]
    end

    subgraph state ["State Manager"]
        SM["ElasticEPStateManager"]
    end

    subgraph poll ["Forward Loop"]
        MJR["maybe_join_ep_ranks"]
    end

    subgraph backends ["Backends"]
        MC["Process Group"]
        NIXL["NixlEPBuffer"]
    end

    subgraph auto ["Automatic"]
        EPLB["EPLB"]
    end

    subgraph ranks ["GPU Ranks"]
        R0["Existing Ranks"]
        RN["New Ranks"]
    end

    Orch -->|"launch ranks"| RN
    Orch -->|"POST /scale"| HTTP
    HTTP --> ZMQ --> SM
    MJR -->|"reads effective_ep_size"| SM
    MJR -->|"activate_ranks"| MC
    MJR -->|"on_scale"| NIXL
    MJR -->|"rebalance"| EPLB
    MC --> R0
    MC --> RN
    NIXL --> R0
    NIXL --> RN
    RN -->|"join_group"| MC

    classDef nvidiaGreen fill:#76b900,stroke:#000,stroke-width:2px,color:#fff
    classDef nvidiaBlue fill:#0071c5,stroke:#000,stroke-width:2px,color:#fff
    classDef nvidiaBlack fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    classDef nvidiaGray fill:#5e5e5e,stroke:#000,stroke-width:2px,color:#fff
    classDef nvidiaLightGray fill:#cdcdcd,stroke:#000,stroke-width:2px,color:#000

    class Orch nvidiaLightGray
    class HTTP,ZMQ nvidiaGreen
    class SM nvidiaBlack
    class MJR nvidiaBlue
    class MC,NIXL nvidiaBlue
    class EPLB nvidiaGray
    class R0,RN nvidiaGray
Loading

Pre-allocation with max_ep_size

At server launch, we pre-allocate active_ranks and MoE backend resources to max_ep_size (via --max-ep-size):

Launch with --tp 4 --max-ep-size 16:
  active_ranks = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
                  ← active →  ← reserved for future scale-up →

The zeros beyond ep_size are empty slots with no processes running and no connections established. But the tensor and buffers are already sized for up to 16 ranks. Which zeros the system actively monitors (vs ignores) is controlled by effective_ep_size, defined in detail below.

MoE backends like NIXL also need to know the maximum number of ranks upfront to allocate their internal resources (RDMA buffers, metadata structures, per-rank connection slots). These can't be resized on the fly without tearing down and rebuilding the buffer. So --max-ep-size does two things: it sizes the active_ranks tensor for SGLang, and it tells the MoE backend how much capacity to prepare for. Scaling after that only changes which connections are active, not the buffer layout.

Core idea: Scale = Extend + Recover

Scale-up reuses the recovery join flow, with one extra step at the front:

Step Recovery (PR #15771) Scale-up (this RFC)
1. Prepare group n/a (slot already exists) extend_group_size_to(N)
2. Set effective size n/a (unchanged) effective_ep_size = N
3. Launch new rank --ep-join-mode recover --ep-join-mode scale
4–10. Join flow join_group, poll, recover_ranks, metadata sync Same
11. Post-join Skip EPLB, reset active_ranks Fire EPLB, expand active_ranks

Steps 3 through 10 are identical between recovery and scale. The only differences are the group-extend at the start and how we handle EPLB afterwards.

New ranks declare their intent when they launch. This replaces PR #15771's boolean --elastic-ep-rejoin flag with a multi-value --ep-join-mode, where --ep-join-mode recover is equivalent to the old --elastic-ep-rejoin:

sglang serve ... --ep-join-mode scale     # brand new rank (this RFC)
sglang serve ... --ep-join-mode recover   # same as --elastic-ep-rejoin (PR #15771)
Mode EPLB behavior active_ranks
scale Fire: redistribute experts to include new rank Set new slot to 1
recover Skip: sync weights directly (avoids stale metadata issue) Reset to original world

activate_ranks: unified abstraction

Both recovery and scale-up end up doing the same thing per process group: "accept these ranks." We wrap that in activate_ranks(pg, rank_ids):

def activate_ranks(pg, rank_ids):
    # Backend-specific acceptance. Publishes sync state so
    # joining ranks can complete their join_group()
    recover_ranks(pg, rank_ids)

The poll loop calls activate_ranks per PG for both cases. Everything that differs (EPLB handling, MoE buffer connections) happens explicitly in the poll loop after the PG loop, not hidden inside activate_ranks.

ElasticEPStateManager: proposed API

ElasticEPStateManager is the singleton that owns active_ranks and coordinates scaling state:

Method What it does
set_effective_ep_size(N) Sets how many slots the poll loop should scan.
get_effective_ep_size() Returns the current effective size (may include pending/unjoined slots).
get_num_active_ranks() Count of 1s in active_ranks, i.e. the actually-serving ranks.
original_ep_size The ep_size from server launch. Used to tell recovery apart from scale (is the joining rank within the original world or beyond it?).
snapshot_active_to_last() Copies active_rankslast_active_ranks so model_runner can detect changes next forward pass.
on_scale(new_ep_size) Sets _scale_to for the NIXL dispatch path. Called after activate_ranks for scale-up, or directly in the HTTP handler for scale-down.

activate_ranks(pg, rank_ids) is a separate per-PG function, not on the state manager.

Why we need the HTTP API at all: The poll loop only scans active_ranks[0..effective_ep_size). In recovery, the crashed rank's slot is already in that range, so the zero shows up automatically. For scale-up, the new slots are beyond effective_ep_size and the poll loop can't see them. The HTTP call (POST /scale_elastic_ep) extends the group and bumps effective_ep_size, which makes the new slots visible to the poll loop. After that, everything is the same as recovery.


Scale-Up Flow

Dependency chain

Each step below depends on the one before it:

POST /scale_elastic_ep
  → extend_group_size_to(N)     grow group so new slots exist
    → new rank's join_group()   can now complete (was blocked before extend)
      → activate_ranks()        accept ranks, sync metadata
        → on_scale()            tell NIXL about the new size
          → new rank unblocked  enters its forward loop
            → MoE buffer connect  both sides do the TCPStore handshake
              → EPLB rebalance    experts get redistributed

The root of the chain is extend_group_size_to: nothing can happen before the scale API call triggers it.

On existing ranks

  # ── NEW: HTTP handler (via ZMQ pipeline), returns right away ──
+ def handle_scale_elastic_ep(new_tp_size):
+     for pg in iter_live_process_groups():
+         extend_group_size_to(pg, new_tp_size)
+     ElasticEPStateManager.set_effective_ep_size(new_tp_size)
+     # Don't trigger MoE buffer connections yet. See ordering section below.

  # Every forward pass, non-blocking poll
  # (was maybe_recover_ep_ranks in PR #15771, renamed to handle both paths)
  def maybe_join_ep_ranks():
-     ranks_to_join = [i for i in range(len(active_ranks)) if not active_ranks[i]]
+     ranks_to_join = [i for i in range(effective_ep_size) if not active_ranks[i]]
      if not ranks_to_join:
          return

      # NOTE: get_peer_state is an allreduce, all active ranks must call it.
      # See open questions re: rank agreement before entering this collective.
      if not all(get_peer_state(ranks_to_join)):  # existing from PR #15771
          return

      # Accept into each process group
      for pg in iter_live_process_groups():       # existing from PR #15771
          activate_ranks(pg, ranks_to_join)        # wraps recover_ranks

      broadcast_expert_location_metadata()         # existing from PR #15771

-     eplb_manager.reset_generator()               # PR #15771: always reset
+     # NEW: only reset EPLB for recovery (stale metadata).
+     # Scale-up lets EPLB fire normally (no stale data).
+     if is_recovery_join(ranks_to_join):
+         eplb_manager.reset_generator()

+     # NEW: tell NIXL about the new size (after activate_ranks, not in
+     # HTTP handler to avoid deadlock, see ordering section).
+     on_scale(ElasticEPStateManager.get_effective_ep_size())

      ElasticEPStateManager.snapshot_active_to_last()  # existing from PR #15771

On new ranks

sglang serve --model-path <model> --tp 4 --dp 4 \
    --elastic-ep-backend mooncake --moe-a2a-backend nixl \
    --ep-join-mode scale --node-rank 1
# 1. Create process group in extension mode (doesn't wait for peers)
init_process_group(isExtension=True)

# 2. Load model from disk (completely independent, no PG needed)
load_model()

# 3. Capture CUDA graphs
capture_cuda_graphs()

# 4. Join all process groups
#    This blocks until existing ranks call extend + activate_ranks
join_process_groups()

# 5. Get expert placement info from existing ranks
broadcast_expert_location_metadata()

# 6. Start serving (NIXL buffer connects on first MoE dispatch)

MoE buffer connection ordering

Since connect_ranks() is a two-sided handshake, both sides need to be in the call at the same time, so we have to be careful about when we trigger it:

1. extend_group_size_to(N), effective_ep_size = N    HTTP handler
2. get_peer_state() poll                              forward loop, non-blocking
3. join_group() completes                             new rank finishes PG-level connect
4. activate_ranks()                                   unblocks the new rank
5. on_scale(N) → _scale_to = N                        tells NIXL to update connections
6. New rank: first dispatch → publishes its NIXL metadata
7. Existing rank: next dispatch → connect_ranks() → both sides handshake

Why on_scale() has to come after step 4, not earlier: If we called it in the HTTP handler, existing ranks would try connect_ranks on their next MoE dispatch. But connect_ranks blocks waiting for the new rank's TCPStore key, and the new rank can't publish that key until join_group() finishes, which needs activate_ranks() from the existing ranks. That's a circular wait = deadlock.

Between steps 6 and 7, existing ranks briefly block in MoE dispatch while waiting for the new rank's NIXL key. In practice this is a few seconds since the new rank is already in its forward loop by then.

NIXL dispatch path: lazy connection update

On every MoE dispatch, get_nixl_buffer() checks if the connections need updating:

def get_nixl_buffer():
    if _scale_to != _connected_ep_size:
        _update_connections(_scale_to)
        # scale-up:   connect_ranks(new_rank_ids)
        # scale-down: disconnect_ranks(removed_ids)
        _connected_ep_size = _scale_to
    return _buffer

When nothing is scaling, this is a single integer compare, zero cost. When scaling is happening, connect_ranks runs once on the first dispatch after on_scale().

New ranks allocate their NixlEPBuffer (pre-sized to max_ep_size) on first get_nixl_buffer() call and connect to all active ranks.

effective_ep_size: what's managed vs what's reserved

effective_ep_size is the upper bound of the poll range: the number of rank slots the system actively manages (including both active ranks and pending/faulted slots that the poll loop monitors). It is NOT the number of currently serving ranks (that's get_num_active_ranks()), and it's NOT the hard maximum (that's max_ep_size).

  • At launch: effective_ep_size = ep_size (e.g., 4)
  • After POST /scale_elastic_ep {new_tp_size: 8}: effective_ep_size = 8
  • After scale-down to 6: effective_ep_size = 6

Slots beyond effective_ep_size are reserved and invisible to the poll loop:

At launch (ep_size=4, effective_ep_size=4, max_ep_size=16):
  [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   ← active →  ← reserved (not polled) →
   poll range [0..4), all active, nothing to poll for

After scale API (effective_ep_size=8):
  [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   ← active →  ← poll zone →  ← reserved →
   poll range [0..8), zeros at [4..7] trigger get_peer_state

After new ranks join:
  [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
   ← all active →             ← reserved →
   poll range [0..8), all active, nothing to poll for

The poll loop only looks at [0..effective_ep_size). This way we don't waste time polling for ranks that nobody has asked to add yet.

A zero in active_ranks means different things depending on where it sits:

State active_ranks[i] Where Meaning
Active 1 anywhere Rank is serving, dispatchers route to it
Failed 0 i < effective_ep_size Was active, crashed. Poll loop monitors for recovery
Pending 0 i < effective_ep_size Never active. Poll loop monitors for scale-up join
Reserved 0 i >= effective_ep_size Slot exists in the tensor but not managed yet

The tensor itself doesn't distinguish failed from pending; both are zeros within the poll range. The poll loop treats them the same (poll, activate when ready). The distinction only matters for post-join EPLB behavior (see is_recovery_join).


Scale-Up Sequence Diagram

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'primaryColor': '#76b900',
    'primaryTextColor': '#fff',
    'primaryBorderColor': '#000',
    'lineColor': '#000',
    'secondaryColor': '#0071c5',
    'tertiaryColor': '#5e5e5e',
    'background': '#fff',
    'mainBkg': '#76b900',
    'secondBkg': '#0071c5',
    'tertiaryBkg': '#5e5e5e'
  }
}}%%
sequenceDiagram
    participant Orch as Orchestrator
    participant HTTP as HTTP Server
    participant SM as StateManager
    participant MR as model_runner
    participant EPLB as EPLB
    participant R4 as New Rank 4

    Note over Orch,R4: Phase 1 - Launch new ranks
    Orch->>R4: sglang serve ... --ep-join-mode scale
    R4->>R4: init (extension mode) + load model

    Note over Orch,R4: Phase 2 - Scale request
    Orch->>HTTP: POST /scale_elastic_ep {new_tp_size: 8}
    HTTP->>SM: extend_group_size_to(8), effective_ep_size = 8
    HTTP-->>Orch: scaling initiated

    Note over Orch,R4: Phase 3 - Poll (healthy ranks keep serving at ep=4)
    MR->>MR: forward() → maybe_join_ep_ranks()
    MR->>SM: get_peer_state([4..7]) → not ready

    R4->>SM: join_group() - connects at PG level
    MR->>SM: get_peer_state([4..7]) → all ready

    Note over Orch,R4: Phase 4 - Accept + sync
    MR->>SM: activate_ranks([4..7]) per PG
    MR->>R4: broadcast expert metadata
    SM->>SM: active_ranks[4:8] = 1

    Note over Orch,R4: Phase 5 - MoE buffer connect
    MR->>MR: on_scale(8)
    R4->>R4: first dispatch → publishes buffer metadata
    MR->>MR: next dispatch → connect_ranks([4..7]) → handshake

    Note over Orch,R4: Phase 6 - EPLB
    MR->>EPLB: rebalance()
    EPLB->>EPLB: redistribute experts to 8 ranks
    Note over MR,R4: Inference continues at ep_size=8
Loading

Scale-Down Flow

Scale-down is simpler than scale-up since there's no join flow needed. The main concern is graceful shutdown: if we just kill ranks, the FT path would detect them as faulted and try to recover them every forward pass. So we need to explicitly shrink effective_ep_size to stop the poll loop from scanning those slots.

def handle_scale_down(new_tp_size):
    # 1. Stop routing to departing ranks. Dispatchers and EPLB read this.
    active_ranks[new_tp_size:effective_ep_size] = 0

    # 2. EPLB fires on next forward pass, consolidates experts onto
    #    remaining ranks before those ranks disconnect.

    # 3. Shrink poll range. Without this, FT would treat the killed
    #    ranks as faults and try to recover them every forward pass.
    ElasticEPStateManager.set_effective_ep_size(new_tp_size)

    # 4. Wait for in-flight work on departing ranks to finish.
    #    Requests already owned by those ranks complete naturally.

    # 5. Disconnect NIXL (one-sided, no handshake needed).
    #    Safe now because departing ranks have finished their work.
    on_scale(new_tp_size)

    # 6. Orchestrator kills the departing ranks.
%%{init: {
  'theme': 'base',
  'themeVariables': {
    'primaryColor': '#76b900',
    'primaryTextColor': '#fff',
    'primaryBorderColor': '#000',
    'lineColor': '#000',
    'secondaryColor': '#0071c5',
    'tertiaryColor': '#5e5e5e',
    'background': '#fff',
    'mainBkg': '#76b900',
    'secondBkg': '#0071c5',
    'tertiaryBkg': '#5e5e5e'
  }
}}%%
sequenceDiagram
    participant Orch as Orchestrator
    participant HTTP as HTTP Server
    participant SM as StateManager
    participant MR as model_runner
    participant EPLB as EPLB
    participant R7 as Removing Rank 7

    Orch->>HTTP: POST /scale_elastic_ep {new_tp_size: 6}
    HTTP->>SM: active_ranks[6:8] = 0
    HTTP->>SM: set_effective_ep_size(6)

    MR->>EPLB: rebalance() → consolidate experts onto 6 ranks

    Note over R7: In-flight requests finish naturally

    MR->>MR: on_scale(6) → disconnect_ranks([6,7])
    Orch->>R7: shutdown
    Note over MR: Inference continues at ep_size=6
Loading

User-Facing API

Server launch

# Start with 4 GPUs, allow scaling up to 16
sglang serve --model-path <model> --tp 4 --dp 4 \
    --elastic-ep-backend mooncake --moe-a2a-backend nixl \
    --max-ep-size 16

# Later: add a new rank (scale-up)
sglang serve --model-path <model> \
    --elastic-ep-backend mooncake --moe-a2a-backend nixl \
    --ep-join-mode scale --node-rank 1

# Or: bring back a crashed rank (recovery)
sglang serve --model-path <model> \
    --elastic-ep-backend mooncake --moe-a2a-backend nixl \
    --ep-join-mode recover --node-rank 1

--max-ep-size 16 tells SGLang to pre-allocate everything for up to 16 ranks. The process group starts at 4 and grows on demand via extend_group_size_to().

HTTP API

POST /scale_elastic_ep        → kick off a scale operation
POST /is_scaling_elastic_ep   → check if scaling is in progress

Request:

{ "new_tp_size": 8 }

Response:

{
  "message": "Scaling initiated from 4 to 8, polling for new ranks",
  "old_tp_size": 4,
  "new_tp_size": 8
}

This returns right away; it just extends the group and sets effective_ep_size. The actual rank joining happens in the background through the poll loop.

Internal: ZMQ Pipeline

The HTTP handler can't call ElasticEPStateManager directly; SGLang routes control messages through ZMQ. The scale request follows the same pattern as UpdateWeightReqInput:

POST /scale_elastic_ep
  → http_server (FastAPI)
      → TokenizerManager (ZMQ send)
          → Scheduler (ZMQ recv)
              → extend_group_size_to + set_effective_ep_size

This requires defining ScaleElasticEPReqInput / ScaleElasticEPReqOutput message types in io_struct.py and wiring the handlers in TokenizerManager and Scheduler.


dp_attention Topology

In dp_attention mode, each GPU is its own DP shard:

Before (tp=4, dp=4, ep=4):
  DP0/TP0/EP0  DP1/TP1/EP1  DP2/TP2/EP2  DP3/TP3/EP3

After scale to 8:
  DP0/TP0/EP0  ...  DP3/TP3/EP3  DP4/TP4/EP4  ...  DP7/TP7/EP7

Constraint: (new_ep_size - effective_ep_size) % attn_tp_size == 0

New shards create their own attention groups at startup. Existing shards don't change.


Assumptions

What we assume
Backend Mooncake for PG elasticity, NIXL for MoE transport
Topology dp_attention mode with ep_size == tp_size
Rank ordering Sequential: scale-up always appends new ranks at effective_ep_size, never fills fault gaps in the middle. Gaps are handled by recovery, not scaling.
Pre-allocation active_ranks and MoE buffers sized to max_ep_size from the start
Process readiness New ranks need to reach join_group() within a reasonable time after the group is extended
Orchestration Launching and killing rank processes is someone else's job

Open Questions and Out of Scope

Still need to figure out

Topic Question
EPLB after scale-up Recovery skips EPLB because of stale metadata. Scale-up adds brand-new ranks with no stale data, so EPLB should be fine, but we should verify this.
Scale-up with faulted ranks Should the scale API reject a request if there are faulted (inactive) ranks within the current effective_ep_size? Allowing it works (scale appends past the gap), but blocking until all ranks are healthy simplifies the flow and avoids mixed recovery + scale in the same poll cycle.
Rank agreement before get_peer_state get_peer_state() is an allreduce, so all active ranks must call it. But each rank decides independently whether to call it, based on its local ranks_to_join. If ranks process the scale signal on different forward passes, some enter the allreduce and others skip, causing a hang. This same race exists in PR #15771's recovery path (fault detected on different passes). Two possible fixes: (a) Timeout on allreduce: add a timeout to get_peer_state's work->wait() (already supported by Mooncake). If timeout, treat as "not ready" and retry next pass. Simple, but the ahead rank wastes up to N seconds blocked. (b) Staged barrier (vLLM-style): before entering get_peer_state, ranks write intent to a shared store and wait briefly. If not all agree, skip and retry. Zero-wait when ranks agree, costs one extra forward pass when they don't. Better for latency, more code to implement.
CUDA graph recapture for existing ranks After a topology change, do existing ranks need to recapture CUDA graphs? PR #15771 recovery doesn't recapture and works. Depends on whether any graph-captured operations encode the EP topology.
Drain for scale-down Scale-up doesn't need drain (non-blocking). Scale-down might, since we need to move experts off the departing ranks before they shut down.
world_size for new ranks When a new rank calls init_process_group, should it pass the extended size or the original? The extension flag might bypass this.

Not in scope for this RFC

  • P2P weight transfer: loading from disk is simpler; P2P would need a "weights ready" barrier.
  • Rank-explicit API: add_ranks([4,5,6,7]) or remove_rank(6) style (see vLLM PR #38862 for rank-specific removal during scale-down). Currently not supported; scale-down removes from the tail of the active range.

Related Work

Reference How it relates
SGLang PR #15771 (rank recovery) The foundation for this RFC. We reuse join_group, recover_ranks, get_peer_state, and the non-blocking poll loop. Scale-up adds extend_group_size_to as the one new call. We propose --ep-join-mode {scale|recover} to unify both paths under one flag.
vLLM NixlEPAll2AllManager.get_handle() Uses a similar lazy dispatch-path pattern: checks buffer size against PG size on every call. Inspired the _scale_to != _connected_ep_size check here.
NIXL EP Reference implementation of elastic EP communication built on NIXL's device API. Provides connect_ranks() / disconnect_ranks() per-pair APIs and an elastic test suite with multi-phase scaling plans. Our NIXL buffer integration follows the same pattern.
SGLang PR #8961 (Elastic EP roadmap) The fault tolerance milestones that this work builds on.

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions