[RFC] Elastic EP Scale

## Motivation

Today, large MoE models (like DeepSeek V3) run with a fixed number of GPUs decided at launch time. The process groups, expert sharding, and dispatch paths are all set up once and never change.

That's fine until it isn't:

- Traffic spikes and you need more capacity, but the server is already running.
- A cluster job gets preempted or hardware goes down and you lose GPUs.
- Off-peak hours and you're paying for GPUs you don't need.

Right now the only option is to restart the whole server with a different topology. That means killing in-flight requests, waiting for model reload, and hoping nothing else breaks.

**This RFC proposes adding and removing EP ranks at runtime, without restarting.**

| Scenario | What happens |
|----------|-------------|
| Load spike | Add GPU ranks on the fly |
| Hardware reclaim | Remove ranks gracefully |
| Fault recovery | Replace a crashed rank (already in PR #15771) |
| Cost savings | Shrink to fewer ranks during low traffic |

---

## Background

### Expert Parallelism in SGLang

EP splits the MoE expert layers across `ep_size` GPU ranks. Each rank holds a subset of experts. During a forward pass, tokens get dispatched to the right expert rank (via NIXL or Mooncake all-to-all), computed locally, then combined back.

### `active_ranks`

SGLang already has a tensor that tracks which EP ranks are alive:

```
active_ranks: torch.Tensor  # shape [max_ep_size], dtype int32
                             # 1 = rank is participating, 0 = inactive
```

This tensor is read by MoE dispatchers, EPLB, process groups, the scheduler, and `ElasticEPStateManager`. The key idea of this RFC is that **scaling is just flipping bits in this tensor**, the same thing fault tolerance already does.

### Dynamic group membership (Mooncake PG backend)

The **Mooncake process group backend** provides APIs for changing rank membership on the fly:

| API | What it does |
|-----|-------------|
| `extend_group_size_to(N)` | Grows the process group to N slots. Doesn't block. |
| `get_peer_state(ranks)` | Checks if all active peers can see a rank as connected. Non-blocking poll. |
| `recover_ranks(ranks)` | Marks ranks as active and publishes sync state so joining ranks can finish their setup. |
| `join_group()` | Called by the joining rank. Publishes its metadata, waits for connections, reads sync state. |

The `isExtension` flag lets a new rank create its process group without blocking during init; it defers the actual connection to `join_group()` later.

### PR #15771: Rank Recovery

PR #15771 added the ability to recover a crashed rank: the admin relaunches it, it creates process groups in extension mode, loads the model, and calls `join_group()`. Meanwhile, healthy ranks keep serving and poll each forward pass with `get_peer_state()`. When the recovered rank is ready, they call `recover_ranks()` and it rejoins.

**The key insight here:** scaling can piggyback on this exact same flow. The only difference is that recovery fills an *existing* slot (the rank was there before), while scale-up adds slots *beyond* the original group. That means we need one extra call (`extend_group_size_to()`) to make the group aware of the new slots before the recovery-style join can kick in.

### MoE buffer connections (NIXL)

NIXL manages connections between ranks via `connect_ranks()` and `disconnect_ranks()` work per-pair, so there's no need to rebuild everything. Buffers are pre-allocated to `max_ep_size` at init, so scaling only changes which connections are active.

One important detail: `connect_ranks()` is a **two-sided handshake**. This means both sides need to be calling `connect_ranks()` at roughly the same time.

---

## Design Considerations

### Reusing the recovery interface

PR #15771 already solves the hard problem of dynamically joining a rank into a live serving cluster: extension mode init, `join_group`, non-blocking polling, `recover_ranks`. That machinery works for any rank that wasn't present at startup, not just crashed ones. Rather than building a separate scaling system, we add one call (`extend_group_size_to`) to grow the group and reuse everything else. This keeps the codebase simpler and means both paths benefit from the same bug fixes.

### API style

We considered three ways to express a scale operation:

| Style | Example | Tradeoff |
|-------|---------|----------|
| **Absolute target** *(chosen)* | `{ "new_tp_size": 8 }` | Idempotent, simple validation, maps directly to `extend_group_size_to(N)` |
| Rank-explicit | `{ "rank_ids": [4,5,6,7] }` | More flexible (supports non-contiguous recovery), but requires deeper changes across `ElasticEPStateManager`, buffer backends, and DP shard validation |
| Delta | `{ "count": 4 }` | Human-friendly, but not idempotent |

We go with absolute target for the initial design. Rank-explicit can come later once the core path is proven.

### Rank discovery

Existing ranks discover new ranks the same way they discover recovered ranks in #15771: a non-blocking `get_peer_state()` poll that runs at the end of each forward pass. The scale API doesn't introduce a new discovery mechanism; it just widens the poll range by bumping `effective_ep_size` so the new slots become visible.

### MoE buffer connection ordering

NIXL connections update lazily: each dispatch checks `_scale_to != _connected_ep_size` and calls `connect_ranks` if needed. The critical constraint is that `on_scale()` must be called *after* `activate_ranks()`, not in the HTTP handler. If triggered too early, `connect_ranks` blocks in the dispatch path waiting for the new rank's metadata, but the new rank can't publish that metadata until its `join_group()` completes, which needs `activate_ranks()`. Calling `on_scale` earlier creates a circular wait. Details in the MoE buffer ordering section below.

### Distinguishing scale from recovery

The joining rank declares its intent with `--ep-join-mode scale` or `--ep-join-mode recover`. On the existing rank side, we need a way to determine whether joining ranks are recovering (stale metadata → skip EPLB) or scaling (fresh → let EPLB fire). We wrap this in `is_recovery_join(rank_ids)`. The exact implementation is TBD (could use rank index, PG join metadata, or the flag itself), but the contract is clear: "were these ranks previously active?"

---

## Design Proposal

### System Architecture

```mermaid
%%{init: {
  'theme': 'base',
  'themeVariables': {
    'primaryColor': '#76b900',
    'primaryTextColor': '#fff',
    'primaryBorderColor': '#000',
    'lineColor': '#000',
    'secondaryColor': '#0071c5',
    'tertiaryColor': '#5e5e5e',
    'background': '#fff',
    'mainBkg': '#76b900',
    'secondBkg': '#0071c5',
    'tertiaryBkg': '#5e5e5e'
  }
}}%%
flowchart TB
    subgraph external ["External"]
        Orch["Orchestrator"]
    end

    subgraph sglang_server ["SGLang Server"]
        HTTP["POST /scale_elastic_ep"]
        ZMQ["ZMQ Pipeline"]
    end

    subgraph state ["State Manager"]
        SM["ElasticEPStateManager"]
    end

    subgraph poll ["Forward Loop"]
        MJR["maybe_join_ep_ranks"]
    end

    subgraph backends ["Backends"]
        MC["Process Group"]
        NIXL["NixlEPBuffer"]
    end

    subgraph auto ["Automatic"]
        EPLB["EPLB"]
    end

    subgraph ranks ["GPU Ranks"]
        R0["Existing Ranks"]
        RN["New Ranks"]
    end

    Orch -->|"launch ranks"| RN
    Orch -->|"POST /scale"| HTTP
    HTTP --> ZMQ --> SM
    MJR -->|"reads effective_ep_size"| SM
    MJR -->|"activate_ranks"| MC
    MJR -->|"on_scale"| NIXL
    MJR -->|"rebalance"| EPLB
    MC --> R0
    MC --> RN
    NIXL --> R0
    NIXL --> RN
    RN -->|"join_group"| MC

    classDef nvidiaGreen fill:#76b900,stroke:#000,stroke-width:2px,color:#fff
    classDef nvidiaBlue fill:#0071c5,stroke:#000,stroke-width:2px,color:#fff
    classDef nvidiaBlack fill:#000,stroke:#fff,stroke-width:2px,color:#fff
    classDef nvidiaGray fill:#5e5e5e,stroke:#000,stroke-width:2px,color:#fff
    classDef nvidiaLightGray fill:#cdcdcd,stroke:#000,stroke-width:2px,color:#000

    class Orch nvidiaLightGray
    class HTTP,ZMQ nvidiaGreen
    class SM nvidiaBlack
    class MJR nvidiaBlue
    class MC,NIXL nvidiaBlue
    class EPLB nvidiaGray
    class R0,RN nvidiaGray
```

### Pre-allocation with `max_ep_size`

At server launch, we pre-allocate `active_ranks` and MoE backend resources to `max_ep_size` (via `--max-ep-size`):

```
Launch with --tp 4 --max-ep-size 16:
  active_ranks = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
                  ← active →  ← reserved for future scale-up →
```

The zeros beyond `ep_size` are empty slots with no processes running and no connections established. But the tensor and buffers are already sized for up to 16 ranks. Which zeros the system actively monitors (vs ignores) is controlled by `effective_ep_size`, defined in detail below.

MoE backends like NIXL also need to know the maximum number of ranks upfront to allocate their internal resources (RDMA buffers, metadata structures, per-rank connection slots). These can't be resized on the fly without tearing down and rebuilding the buffer. So `--max-ep-size` does two things: it sizes the `active_ranks` tensor for SGLang, and it tells the MoE backend how much capacity to prepare for. Scaling after that only changes which connections are active, not the buffer layout.

### Core idea: Scale = Extend + Recover

Scale-up reuses the recovery join flow, with one extra step at the front:

| Step | Recovery (PR #15771) | Scale-up (this RFC) |
|------|---------------------|---------------------|
| 1. Prepare group | n/a (slot already exists) | `extend_group_size_to(N)` |
| 2. Set effective size | n/a (unchanged) | `effective_ep_size = N` |
| 3. Launch new rank | `--ep-join-mode recover` | `--ep-join-mode scale` |
| 4–10. Join flow | `join_group`, poll, `recover_ranks`, metadata sync | **Same** |
| 11. Post-join | Skip EPLB, reset `active_ranks` | **Fire EPLB**, expand `active_ranks` |

Steps 3 through 10 are identical between recovery and scale. The only differences are the group-extend at the start and how we handle EPLB afterwards.

New ranks declare their intent when they launch. This replaces PR #15771's boolean `--elastic-ep-rejoin` flag with a multi-value `--ep-join-mode`, where `--ep-join-mode recover` is equivalent to the old `--elastic-ep-rejoin`:

```bash
sglang serve ... --ep-join-mode scale     # brand new rank (this RFC)
sglang serve ... --ep-join-mode recover   # same as --elastic-ep-rejoin (PR #15771)
```

| Mode | EPLB behavior | `active_ranks` |
|------|--------------|----------------|
| `scale` | Fire: redistribute experts to include new rank | Set new slot to 1 |
| `recover` | Skip: sync weights directly (avoids stale metadata issue) | Reset to original world |

### `activate_ranks`: unified abstraction

Both recovery and scale-up end up doing the same thing per process group: "accept these ranks." We wrap that in `activate_ranks(pg, rank_ids)`:

```python
def activate_ranks(pg, rank_ids):
    # Backend-specific acceptance. Publishes sync state so
    # joining ranks can complete their join_group()
    recover_ranks(pg, rank_ids)
```

The poll loop calls `activate_ranks` per PG for both cases. Everything that differs (EPLB handling, MoE buffer connections) happens explicitly in the poll loop after the PG loop, not hidden inside `activate_ranks`.

### `ElasticEPStateManager`: proposed API

`ElasticEPStateManager` is the singleton that owns `active_ranks` and coordinates scaling state:

| Method | What it does |
|--------|-------------|
| `set_effective_ep_size(N)` | Sets how many slots the poll loop should scan. |
| `get_effective_ep_size()` | Returns the current effective size (may include pending/unjoined slots). |
| `get_num_active_ranks()` | Count of 1s in `active_ranks`, i.e. the actually-serving ranks. |
| `original_ep_size` | The `ep_size` from server launch. Used to tell recovery apart from scale (is the joining rank within the original world or beyond it?). |
| `snapshot_active_to_last()` | Copies `active_ranks` → `last_active_ranks` so `model_runner` can detect changes next forward pass. |
| `on_scale(new_ep_size)` | Sets `_scale_to` for the NIXL dispatch path. Called after `activate_ranks` for scale-up, or directly in the HTTP handler for scale-down. |

`activate_ranks(pg, rank_ids)` is a separate per-PG function, not on the state manager.

**Why we need the HTTP API at all:** The poll loop only scans `active_ranks[0..effective_ep_size)`. In recovery, the crashed rank's slot is already in that range, so the zero shows up automatically. For scale-up, the new slots are *beyond* `effective_ep_size` and the poll loop can't see them. The HTTP call (`POST /scale_elastic_ep`) extends the group and bumps `effective_ep_size`, which makes the new slots visible to the poll loop. After that, everything is the same as recovery.

---

## Scale-Up Flow

### Dependency chain

Each step below depends on the one before it:

```
POST /scale_elastic_ep
  → extend_group_size_to(N)     grow group so new slots exist
    → new rank's join_group()   can now complete (was blocked before extend)
      → activate_ranks()        accept ranks, sync metadata
        → on_scale()            tell NIXL about the new size
          → new rank unblocked  enters its forward loop
            → MoE buffer connect  both sides do the TCPStore handshake
              → EPLB rebalance    experts get redistributed
```

The root of the chain is `extend_group_size_to`: nothing can happen before the scale API call triggers it.

### On existing ranks

```diff
  # ── NEW: HTTP handler (via ZMQ pipeline), returns right away ──
+ def handle_scale_elastic_ep(new_tp_size):
+     for pg in iter_live_process_groups():
+         extend_group_size_to(pg, new_tp_size)
+     ElasticEPStateManager.set_effective_ep_size(new_tp_size)
+     # Don't trigger MoE buffer connections yet. See ordering section below.

  # Every forward pass, non-blocking poll
  # (was maybe_recover_ep_ranks in PR #15771, renamed to handle both paths)
  def maybe_join_ep_ranks():
-     ranks_to_join = [i for i in range(len(active_ranks)) if not active_ranks[i]]
+     ranks_to_join = [i for i in range(effective_ep_size) if not active_ranks[i]]
      if not ranks_to_join:
          return

      # NOTE: get_peer_state is an allreduce, all active ranks must call it.
      # See open questions re: rank agreement before entering this collective.
      if not all(get_peer_state(ranks_to_join)):  # existing from PR #15771
          return

      # Accept into each process group
      for pg in iter_live_process_groups():       # existing from PR #15771
          activate_ranks(pg, ranks_to_join)        # wraps recover_ranks

      broadcast_expert_location_metadata()         # existing from PR #15771

-     eplb_manager.reset_generator()               # PR #15771: always reset
+     # NEW: only reset EPLB for recovery (stale metadata).
+     # Scale-up lets EPLB fire normally (no stale data).
+     if is_recovery_join(ranks_to_join):
+         eplb_manager.reset_generator()

+     # NEW: tell NIXL about the new size (after activate_ranks, not in
+     # HTTP handler to avoid deadlock, see ordering section).
+     on_scale(ElasticEPStateManager.get_effective_ep_size())

      ElasticEPStateManager.snapshot_active_to_last()  # existing from PR #15771
```

### On new ranks

```bash
sglang serve --model-path <model> --tp 4 --dp 4 \
    --elastic-ep-backend mooncake --moe-a2a-backend nixl \
    --ep-join-mode scale --node-rank 1
```

```python
# 1. Create process group in extension mode (doesn't wait for peers)
init_process_group(isExtension=True)

# 2. Load model from disk (completely independent, no PG needed)
load_model()

# 3. Capture CUDA graphs
capture_cuda_graphs()

# 4. Join all process groups
#    This blocks until existing ranks call extend + activate_ranks
join_process_groups()

# 5. Get expert placement info from existing ranks
broadcast_expert_location_metadata()

# 6. Start serving (NIXL buffer connects on first MoE dispatch)
```

### MoE buffer connection ordering

Since `connect_ranks()` is a two-sided handshake, both sides need to be in the call at the same time, so we have to be careful about when we trigger it:

```
1. extend_group_size_to(N), effective_ep_size = N    HTTP handler
2. get_peer_state() poll                              forward loop, non-blocking
3. join_group() completes                             new rank finishes PG-level connect
4. activate_ranks()                                   unblocks the new rank
5. on_scale(N) → _scale_to = N                        tells NIXL to update connections
6. New rank: first dispatch → publishes its NIXL metadata
7. Existing rank: next dispatch → connect_ranks() → both sides handshake
```

**Why `on_scale()` has to come after step 4, not earlier:** If we called it in the HTTP handler, existing ranks would try `connect_ranks` on their next MoE dispatch. But `connect_ranks` blocks waiting for the new rank's TCPStore key, and the new rank can't publish that key until `join_group()` finishes, which needs `activate_ranks()` from the existing ranks. That's a circular wait = deadlock.

Between steps 6 and 7, existing ranks briefly block in MoE dispatch while waiting for the new rank's NIXL key. In practice this is a few seconds since the new rank is already in its forward loop by then.

### NIXL dispatch path: lazy connection update

On every MoE dispatch, `get_nixl_buffer()` checks if the connections need updating:

```python
def get_nixl_buffer():
    if _scale_to != _connected_ep_size:
        _update_connections(_scale_to)
        # scale-up:   connect_ranks(new_rank_ids)
        # scale-down: disconnect_ranks(removed_ids)
        _connected_ep_size = _scale_to
    return _buffer
```

When nothing is scaling, this is a single integer compare, zero cost. When scaling is happening, `connect_ranks` runs once on the first dispatch after `on_scale()`.

New ranks allocate their NixlEPBuffer (pre-sized to `max_ep_size`) on first `get_nixl_buffer()` call and connect to all active ranks.

### `effective_ep_size`: what's managed vs what's reserved

`effective_ep_size` is the **upper bound of the poll range**: the number of rank slots the system actively manages (including both active ranks and pending/faulted slots that the poll loop monitors). It is NOT the number of currently serving ranks (that's `get_num_active_ranks()`), and it's NOT the hard maximum (that's `max_ep_size`).

- At launch: `effective_ep_size = ep_size` (e.g., 4)
- After `POST /scale_elastic_ep {new_tp_size: 8}`: `effective_ep_size = 8`
- After scale-down to 6: `effective_ep_size = 6`

Slots beyond `effective_ep_size` are reserved and invisible to the poll loop:

```
At launch (ep_size=4, effective_ep_size=4, max_ep_size=16):
  [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   ← active →  ← reserved (not polled) →
   poll range [0..4), all active, nothing to poll for

After scale API (effective_ep_size=8):
  [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   ← active →  ← poll zone →  ← reserved →
   poll range [0..8), zeros at [4..7] trigger get_peer_state

After new ranks join:
  [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
   ← all active →             ← reserved →
   poll range [0..8), all active, nothing to poll for
```

The poll loop only looks at `[0..effective_ep_size)`. This way we don't waste time polling for ranks that nobody has asked to add yet.

A zero in `active_ranks` means different things depending on where it sits:

| State | `active_ranks[i]` | Where | Meaning |
|-------|-------------------|-------|---------|
| Active | 1 | anywhere | Rank is serving, dispatchers route to it |
| Failed | 0 | `i < effective_ep_size` | Was active, crashed. Poll loop monitors for recovery |
| Pending | 0 | `i < effective_ep_size` | Never active. Poll loop monitors for scale-up join |
| Reserved | 0 | `i >= effective_ep_size` | Slot exists in the tensor but not managed yet |

The tensor itself doesn't distinguish failed from pending; both are zeros within the poll range. The poll loop treats them the same (poll, activate when ready). The distinction only matters for post-join EPLB behavior (see `is_recovery_join`).

---

## Scale-Up Sequence Diagram

```mermaid
%%{init: {
  'theme': 'base',
  'themeVariables': {
    'primaryColor': '#76b900',
    'primaryTextColor': '#fff',
    'primaryBorderColor': '#000',
    'lineColor': '#000',
    'secondaryColor': '#0071c5',
    'tertiaryColor': '#5e5e5e',
    'background': '#fff',
    'mainBkg': '#76b900',
    'secondBkg': '#0071c5',
    'tertiaryBkg': '#5e5e5e'
  }
}}%%
sequenceDiagram
    participant Orch as Orchestrator
    participant HTTP as HTTP Server
    participant SM as StateManager
    participant MR as model_runner
    participant EPLB as EPLB
    participant R4 as New Rank 4

    Note over Orch,R4: Phase 1 - Launch new ranks
    Orch->>R4: sglang serve ... --ep-join-mode scale
    R4->>R4: init (extension mode) + load model

    Note over Orch,R4: Phase 2 - Scale request
    Orch->>HTTP: POST /scale_elastic_ep {new_tp_size: 8}
    HTTP->>SM: extend_group_size_to(8), effective_ep_size = 8
    HTTP-->>Orch: scaling initiated

    Note over Orch,R4: Phase 3 - Poll (healthy ranks keep serving at ep=4)
    MR->>MR: forward() → maybe_join_ep_ranks()
    MR->>SM: get_peer_state([4..7]) → not ready

    R4->>SM: join_group() - connects at PG level
    MR->>SM: get_peer_state([4..7]) → all ready

    Note over Orch,R4: Phase 4 - Accept + sync
    MR->>SM: activate_ranks([4..7]) per PG
    MR->>R4: broadcast expert metadata
    SM->>SM: active_ranks[4:8] = 1

    Note over Orch,R4: Phase 5 - MoE buffer connect
    MR->>MR: on_scale(8)
    R4->>R4: first dispatch → publishes buffer metadata
    MR->>MR: next dispatch → connect_ranks([4..7]) → handshake

    Note over Orch,R4: Phase 6 - EPLB
    MR->>EPLB: rebalance()
    EPLB->>EPLB: redistribute experts to 8 ranks
    Note over MR,R4: Inference continues at ep_size=8
```

---

## Scale-Down Flow

Scale-down is simpler than scale-up since there's no join flow needed. The main concern is graceful shutdown: if we just kill ranks, the FT path would detect them as faulted and try to recover them every forward pass. So we need to explicitly shrink `effective_ep_size` to stop the poll loop from scanning those slots.

```python
def handle_scale_down(new_tp_size):
    # 1. Stop routing to departing ranks. Dispatchers and EPLB read this.
    active_ranks[new_tp_size:effective_ep_size] = 0

    # 2. EPLB fires on next forward pass, consolidates experts onto
    #    remaining ranks before those ranks disconnect.

    # 3. Shrink poll range. Without this, FT would treat the killed
    #    ranks as faults and try to recover them every forward pass.
    ElasticEPStateManager.set_effective_ep_size(new_tp_size)

    # 4. Wait for in-flight work on departing ranks to finish.
    #    Requests already owned by those ranks complete naturally.

    # 5. Disconnect NIXL (one-sided, no handshake needed).
    #    Safe now because departing ranks have finished their work.
    on_scale(new_tp_size)

    # 6. Orchestrator kills the departing ranks.
```

```mermaid
%%{init: {
  'theme': 'base',
  'themeVariables': {
    'primaryColor': '#76b900',
    'primaryTextColor': '#fff',
    'primaryBorderColor': '#000',
    'lineColor': '#000',
    'secondaryColor': '#0071c5',
    'tertiaryColor': '#5e5e5e',
    'background': '#fff',
    'mainBkg': '#76b900',
    'secondBkg': '#0071c5',
    'tertiaryBkg': '#5e5e5e'
  }
}}%%
sequenceDiagram
    participant Orch as Orchestrator
    participant HTTP as HTTP Server
    participant SM as StateManager
    participant MR as model_runner
    participant EPLB as EPLB
    participant R7 as Removing Rank 7

    Orch->>HTTP: POST /scale_elastic_ep {new_tp_size: 6}
    HTTP->>SM: active_ranks[6:8] = 0
    HTTP->>SM: set_effective_ep_size(6)

    MR->>EPLB: rebalance() → consolidate experts onto 6 ranks

    Note over R7: In-flight requests finish naturally

    MR->>MR: on_scale(6) → disconnect_ranks([6,7])
    Orch->>R7: shutdown
    Note over MR: Inference continues at ep_size=6
```

---

## User-Facing API

### Server launch

```bash
# Start with 4 GPUs, allow scaling up to 16
sglang serve --model-path <model> --tp 4 --dp 4 \
    --elastic-ep-backend mooncake --moe-a2a-backend nixl \
    --max-ep-size 16

# Later: add a new rank (scale-up)
sglang serve --model-path <model> \
    --elastic-ep-backend mooncake --moe-a2a-backend nixl \
    --ep-join-mode scale --node-rank 1

# Or: bring back a crashed rank (recovery)
sglang serve --model-path <model> \
    --elastic-ep-backend mooncake --moe-a2a-backend nixl \
    --ep-join-mode recover --node-rank 1
```

`--max-ep-size 16` tells SGLang to pre-allocate everything for up to 16 ranks. The process group starts at 4 and grows on demand via `extend_group_size_to()`.

### HTTP API

```
POST /scale_elastic_ep        → kick off a scale operation
POST /is_scaling_elastic_ep   → check if scaling is in progress
```

**Request:**

```json
{ "new_tp_size": 8 }
```

**Response:**

```json
{
  "message": "Scaling initiated from 4 to 8, polling for new ranks",
  "old_tp_size": 4,
  "new_tp_size": 8
}
```

This returns right away; it just extends the group and sets `effective_ep_size`. The actual rank joining happens in the background through the poll loop.

### Internal: ZMQ Pipeline

The HTTP handler can't call `ElasticEPStateManager` directly; SGLang routes control messages through ZMQ. The scale request follows the same pattern as `UpdateWeightReqInput`:

```
POST /scale_elastic_ep
  → http_server (FastAPI)
      → TokenizerManager (ZMQ send)
          → Scheduler (ZMQ recv)
              → extend_group_size_to + set_effective_ep_size
```

This requires defining `ScaleElasticEPReqInput` / `ScaleElasticEPReqOutput` message types in `io_struct.py` and wiring the handlers in `TokenizerManager` and `Scheduler`.

---

## dp_attention Topology

In `dp_attention` mode, each GPU is its own DP shard:

```
Before (tp=4, dp=4, ep=4):
  DP0/TP0/EP0  DP1/TP1/EP1  DP2/TP2/EP2  DP3/TP3/EP3

After scale to 8:
  DP0/TP0/EP0  ...  DP3/TP3/EP3  DP4/TP4/EP4  ...  DP7/TP7/EP7
```

**Constraint:** `(new_ep_size - effective_ep_size) % attn_tp_size == 0`

New shards create their own attention groups at startup. Existing shards don't change.

---

## Assumptions

| | What we assume |
|--|---------------|
| Backend | Mooncake for PG elasticity, NIXL for MoE transport |
| Topology | `dp_attention` mode with `ep_size == tp_size` |
| Rank ordering | Sequential: scale-up always appends new ranks at `effective_ep_size`, never fills fault gaps in the middle. Gaps are handled by recovery, not scaling. |
| Pre-allocation | `active_ranks` and MoE buffers sized to `max_ep_size` from the start |
| Process readiness | New ranks need to reach `join_group()` within a reasonable time after the group is extended |
| Orchestration | Launching and killing rank processes is someone else's job |

---

## Open Questions and Out of Scope

### Still need to figure out

| Topic | Question |
|-------|---------|
| EPLB after scale-up | Recovery skips EPLB because of stale metadata. Scale-up adds brand-new ranks with no stale data, so EPLB should be fine, but we should verify this. |
| Scale-up with faulted ranks | Should the scale API reject a request if there are faulted (inactive) ranks within the current `effective_ep_size`? Allowing it works (scale appends past the gap), but blocking until all ranks are healthy simplifies the flow and avoids mixed recovery + scale in the same poll cycle. |
| Rank agreement before `get_peer_state` | `get_peer_state()` is an allreduce, so all active ranks must call it. But each rank decides independently whether to call it, based on its local `ranks_to_join`. If ranks process the scale signal on different forward passes, some enter the allreduce and others skip, causing a hang. This same race exists in PR #15771's recovery path (fault detected on different passes). Two possible fixes: **(a) Timeout on allreduce**: add a timeout to `get_peer_state`'s `work->wait()` (already supported by Mooncake). If timeout, treat as "not ready" and retry next pass. Simple, but the ahead rank wastes up to N seconds blocked. **(b) Staged barrier** (vLLM-style): before entering `get_peer_state`, ranks write intent to a shared store and wait briefly. If not all agree, skip and retry. Zero-wait when ranks agree, costs one extra forward pass when they don't. Better for latency, more code to implement. |
| CUDA graph recapture for existing ranks | After a topology change, do existing ranks need to recapture CUDA graphs? PR #15771 recovery doesn't recapture and works. Depends on whether any graph-captured operations encode the EP topology. |
| Drain for scale-down | Scale-up doesn't need drain (non-blocking). Scale-down might, since we need to move experts off the departing ranks before they shut down. |
| `world_size` for new ranks | When a new rank calls `init_process_group`, should it pass the extended size or the original? The extension flag might bypass this. |

### Not in scope for this RFC

- **P2P weight transfer**: loading from disk is simpler; P2P would need a "weights ready" barrier.
- **Rank-explicit API**: `add_ranks([4,5,6,7])` or `remove_rank(6)` style (see vLLM PR #38862 for rank-specific removal during scale-down). Currently not supported; scale-down removes from the tail of the active range.

---

## Related Work

| Reference | How it relates |
|-----------|---------------|
| SGLang PR #15771 (rank recovery) | **The foundation for this RFC.** We reuse `join_group`, `recover_ranks`, `get_peer_state`, and the non-blocking poll loop. Scale-up adds `extend_group_size_to` as the one new call. We propose `--ep-join-mode {scale\|recover}` to unify both paths under one flag. |
| vLLM `NixlEPAll2AllManager.get_handle()` | Uses a similar lazy dispatch-path pattern: checks buffer size against PG size on every call. Inspired the `_scale_to != _connected_ep_size` check here. |
| [NIXL EP](https://github.com/ai-dynamo/nixl/tree/main/examples/device/ep) | Reference implementation of elastic EP communication built on NIXL's device API. Provides `connect_ranks()` / `disconnect_ranks()` per-pair APIs and an elastic test suite with multi-phase scaling plans. Our NIXL buffer integration follows the same pattern. |
| SGLang PR #8961 (Elastic EP roadmap) | The fault tolerance milestones that this work builds on. |


Method	What it does
`set_effective_ep_size(N)`	Sets how many slots the poll loop should scan.
`get_effective_ep_size()`	Returns the current effective size (may include pending/unjoined slots).
`get_num_active_ranks()`	Count of 1s in `active_ranks`, i.e. the actually-serving ranks.
`original_ep_size`	The `ep_size` from server launch. Used to tell recovery apart from scale (is the joining rank within the original world or beyond it?).
`snapshot_active_to_last()`	Copies `active_ranks` → `last_active_ranks` so `model_runner` can detect changes next forward pass.
`on_scale(new_ep_size)`	Sets `_scale_to` for the NIXL dispatch path. Called after `activate_ranks` for scale-up, or directly in the HTTP handler for scale-down.

Scenario	What happens
Load spike	Add GPU ranks on the fly
Hardware reclaim	Remove ranks gracefully
Fault recovery	Replace a crashed rank (already in PR #15771)
Cost savings	Shrink to fewer ranks during low traffic

API	What it does
`extend_group_size_to(N)`	Grows the process group to N slots. Doesn't block.
`get_peer_state(ranks)`	Checks if all active peers can see a rank as connected. Non-blocking poll.
`recover_ranks(ranks)`	Marks ranks as active and publishes sync state so joining ranks can finish their setup.
`join_group()`	Called by the joining rank. Publishes its metadata, waits for connections, reads sync state.

Style	Example	Tradeoff
Absolute target (chosen)	`{ "new_tp_size": 8 }`	Idempotent, simple validation, maps directly to `extend_group_size_to(N)`
Rank-explicit	`{ "rank_ids": [4,5,6,7] }`	More flexible (supports non-contiguous recovery), but requires deeper changes across `ElasticEPStateManager`, buffer backends, and DP shard validation
Delta	`{ "count": 4 }`	Human-friendly, but not idempotent

Step	Recovery (PR #15771)	Scale-up (this RFC)
1. Prepare group	n/a (slot already exists)	`extend_group_size_to(N)`
2. Set effective size	n/a (unchanged)	`effective_ep_size = N`
3. Launch new rank	`--ep-join-mode recover`	`--ep-join-mode scale`
4–10. Join flow	`join_group`, poll, `recover_ranks`, metadata sync	Same
11. Post-join	Skip EPLB, reset `active_ranks`	Fire EPLB, expand `active_ranks`

Mode	EPLB behavior	`active_ranks`
`scale`	Fire: redistribute experts to include new rank	Set new slot to 1
`recover`	Skip: sync weights directly (avoids stale metadata issue)	Reset to original world

State	`active_ranks[i]`	Where	Meaning
Active	1	anywhere	Rank is serving, dispatchers route to it
Failed	0	`i < effective_ep_size`	Was active, crashed. Poll loop monitors for recovery
Pending	0	`i < effective_ep_size`	Never active. Poll loop monitors for scale-up join
Reserved	0	`i >= effective_ep_size`	Slot exists in the tensor but not managed yet

	What we assume
Backend	Mooncake for PG elasticity, NIXL for MoE transport
Topology	`dp_attention` mode with `ep_size == tp_size`
Rank ordering	Sequential: scale-up always appends new ranks at `effective_ep_size`, never fills fault gaps in the middle. Gaps are handled by recovery, not scaling.
Pre-allocation	`active_ranks` and MoE buffers sized to `max_ep_size` from the start
Process readiness	New ranks need to reach `join_group()` within a reasonable time after the group is extended
Orchestration	Launching and killing rank processes is someone else's job

Topic	Question
EPLB after scale-up	Recovery skips EPLB because of stale metadata. Scale-up adds brand-new ranks with no stale data, so EPLB should be fine, but we should verify this.
Scale-up with faulted ranks	Should the scale API reject a request if there are faulted (inactive) ranks within the current `effective_ep_size`? Allowing it works (scale appends past the gap), but blocking until all ranks are healthy simplifies the flow and avoids mixed recovery + scale in the same poll cycle.
Rank agreement before `get_peer_state`	`get_peer_state()` is an allreduce, so all active ranks must call it. But each rank decides independently whether to call it, based on its local `ranks_to_join`. If ranks process the scale signal on different forward passes, some enter the allreduce and others skip, causing a hang. This same race exists in PR #15771's recovery path (fault detected on different passes). Two possible fixes: (a) Timeout on allreduce: add a timeout to `get_peer_state`'s `work->wait()` (already supported by Mooncake). If timeout, treat as "not ready" and retry next pass. Simple, but the ahead rank wastes up to N seconds blocked. (b) Staged barrier (vLLM-style): before entering `get_peer_state`, ranks write intent to a shared store and wait briefly. If not all agree, skip and retry. Zero-wait when ranks agree, costs one extra forward pass when they don't. Better for latency, more code to implement.
CUDA graph recapture for existing ranks	After a topology change, do existing ranks need to recapture CUDA graphs? PR #15771 recovery doesn't recapture and works. Depends on whether any graph-captured operations encode the EP topology.
Drain for scale-down	Scale-up doesn't need drain (non-blocking). Scale-down might, since we need to move experts off the departing ranks before they shut down.
`world_size` for new ranks	When a new rank calls `init_process_group`, should it pass the extended size or the original? The extension flag might bypass this.

Reference	How it relates
SGLang PR #15771 (rank recovery)	The foundation for this RFC. We reuse `join_group`, `recover_ranks`, `get_peer_state`, and the non-blocking poll loop. Scale-up adds `extend_group_size_to` as the one new call. We propose `--ep-join-mode {scale\|recover}` to unify both paths under one flag.
vLLM `NixlEPAll2AllManager.get_handle()`	Uses a similar lazy dispatch-path pattern: checks buffer size against PG size on every call. Inspired the `_scale_to != _connected_ep_size` check here.
NIXL EP	Reference implementation of elastic EP communication built on NIXL's device API. Provides `connect_ranks()` / `disconnect_ranks()` per-pair APIs and an elastic test suite with multi-phase scaling plans. Our NIXL buffer integration follows the same pattern.
SGLang PR #8961 (Elastic EP roadmap)	The fault tolerance milestones that this work builds on.

[RFC] Elastic EP Scale #22788

Description

Motivation

Background

Expert Parallelism in SGLang

active_ranks

Dynamic group membership (Mooncake PG backend)

PR #15771: Rank Recovery

MoE buffer connections (NIXL)

Design Considerations

Reusing the recovery interface

API style

Rank discovery

MoE buffer connection ordering

Distinguishing scale from recovery

Design Proposal

System Architecture

Pre-allocation with max_ep_size

Core idea: Scale = Extend + Recover

activate_ranks: unified abstraction

ElasticEPStateManager: proposed API

Scale-Up Flow

Dependency chain

On existing ranks

On new ranks

MoE buffer connection ordering

NIXL dispatch path: lazy connection update

effective_ep_size: what's managed vs what's reserved

Scale-Up Sequence Diagram

Scale-Down Flow

User-Facing API

Server launch

HTTP API

Internal: ZMQ Pipeline

dp_attention Topology

Assumptions

Open Questions and Out of Scope

Still need to figure out

Not in scope for this RFC

Related Work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`active_ranks`

Pre-allocation with `max_ep_size`

`activate_ranks`: unified abstraction

`ElasticEPStateManager`: proposed API

`effective_ep_size`: what's managed vs what's reserved