You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today, large MoE models (like DeepSeek V3) run with a fixed number of GPUs decided at launch time. The process groups, expert sharding, and dispatch paths are all set up once and never change.
That's fine until it isn't:
Traffic spikes and you need more capacity, but the server is already running.
A cluster job gets preempted or hardware goes down and you lose GPUs.
Off-peak hours and you're paying for GPUs you don't need.
Right now the only option is to restart the whole server with a different topology. That means killing in-flight requests, waiting for model reload, and hoping nothing else breaks.
This RFC proposes adding and removing EP ranks at runtime, without restarting.
EP splits the MoE expert layers across ep_size GPU ranks. Each rank holds a subset of experts. During a forward pass, tokens get dispatched to the right expert rank (via NIXL or Mooncake all-to-all), computed locally, then combined back.
active_ranks
SGLang already has a tensor that tracks which EP ranks are alive:
This tensor is read by MoE dispatchers, EPLB, process groups, the scheduler, and ElasticEPStateManager. The key idea of this RFC is that scaling is just flipping bits in this tensor, the same thing fault tolerance already does.
Dynamic group membership (Mooncake PG backend)
The Mooncake process group backend provides APIs for changing rank membership on the fly:
API
What it does
extend_group_size_to(N)
Grows the process group to N slots. Doesn't block.
get_peer_state(ranks)
Checks if all active peers can see a rank as connected. Non-blocking poll.
recover_ranks(ranks)
Marks ranks as active and publishes sync state so joining ranks can finish their setup.
join_group()
Called by the joining rank. Publishes its metadata, waits for connections, reads sync state.
The isExtension flag lets a new rank create its process group without blocking during init; it defers the actual connection to join_group() later.
PR #15771 added the ability to recover a crashed rank: the admin relaunches it, it creates process groups in extension mode, loads the model, and calls join_group(). Meanwhile, healthy ranks keep serving and poll each forward pass with get_peer_state(). When the recovered rank is ready, they call recover_ranks() and it rejoins.
The key insight here: scaling can piggyback on this exact same flow. The only difference is that recovery fills an existing slot (the rank was there before), while scale-up adds slots beyond the original group. That means we need one extra call (extend_group_size_to()) to make the group aware of the new slots before the recovery-style join can kick in.
MoE buffer connections (NIXL)
NIXL manages connections between ranks via connect_ranks() and disconnect_ranks() work per-pair, so there's no need to rebuild everything. Buffers are pre-allocated to max_ep_size at init, so scaling only changes which connections are active.
One important detail: connect_ranks() is a two-sided handshake. This means both sides need to be calling connect_ranks() at roughly the same time.
Design Considerations
Reusing the recovery interface
PR #15771 already solves the hard problem of dynamically joining a rank into a live serving cluster: extension mode init, join_group, non-blocking polling, recover_ranks. That machinery works for any rank that wasn't present at startup, not just crashed ones. Rather than building a separate scaling system, we add one call (extend_group_size_to) to grow the group and reuse everything else. This keeps the codebase simpler and means both paths benefit from the same bug fixes.
API style
We considered three ways to express a scale operation:
Style
Example
Tradeoff
Absolute target(chosen)
{ "new_tp_size": 8 }
Idempotent, simple validation, maps directly to extend_group_size_to(N)
Rank-explicit
{ "rank_ids": [4,5,6,7] }
More flexible (supports non-contiguous recovery), but requires deeper changes across ElasticEPStateManager, buffer backends, and DP shard validation
Delta
{ "count": 4 }
Human-friendly, but not idempotent
We go with absolute target for the initial design. Rank-explicit can come later once the core path is proven.
Rank discovery
Existing ranks discover new ranks the same way they discover recovered ranks in #15771: a non-blocking get_peer_state() poll that runs at the end of each forward pass. The scale API doesn't introduce a new discovery mechanism; it just widens the poll range by bumping effective_ep_size so the new slots become visible.
MoE buffer connection ordering
NIXL connections update lazily: each dispatch checks _scale_to != _connected_ep_size and calls connect_ranks if needed. The critical constraint is that on_scale() must be called afteractivate_ranks(), not in the HTTP handler. If triggered too early, connect_ranks blocks in the dispatch path waiting for the new rank's metadata, but the new rank can't publish that metadata until its join_group() completes, which needs activate_ranks(). Calling on_scale earlier creates a circular wait. Details in the MoE buffer ordering section below.
Distinguishing scale from recovery
The joining rank declares its intent with --ep-join-mode scale or --ep-join-mode recover. On the existing rank side, we need a way to determine whether joining ranks are recovering (stale metadata → skip EPLB) or scaling (fresh → let EPLB fire). We wrap this in is_recovery_join(rank_ids). The exact implementation is TBD (could use rank index, PG join metadata, or the flag itself), but the contract is clear: "were these ranks previously active?"
Design Proposal
System Architecture
%%{init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#76b900',
'primaryTextColor': '#fff',
'primaryBorderColor': '#000',
'lineColor': '#000',
'secondaryColor': '#0071c5',
'tertiaryColor': '#5e5e5e',
'background': '#fff',
'mainBkg': '#76b900',
'secondBkg': '#0071c5',
'tertiaryBkg': '#5e5e5e'
}
}}%%
flowchart TB
subgraph external ["External"]
Orch["Orchestrator"]
end
subgraph sglang_server ["SGLang Server"]
HTTP["POST /scale_elastic_ep"]
ZMQ["ZMQ Pipeline"]
end
subgraph state ["State Manager"]
SM["ElasticEPStateManager"]
end
subgraph poll ["Forward Loop"]
MJR["maybe_join_ep_ranks"]
end
subgraph backends ["Backends"]
MC["Process Group"]
NIXL["NixlEPBuffer"]
end
subgraph auto ["Automatic"]
EPLB["EPLB"]
end
subgraph ranks ["GPU Ranks"]
R0["Existing Ranks"]
RN["New Ranks"]
end
Orch -->|"launch ranks"| RN
Orch -->|"POST /scale"| HTTP
HTTP --> ZMQ --> SM
MJR -->|"reads effective_ep_size"| SM
MJR -->|"activate_ranks"| MC
MJR -->|"on_scale"| NIXL
MJR -->|"rebalance"| EPLB
MC --> R0
MC --> RN
NIXL --> R0
NIXL --> RN
RN -->|"join_group"| MC
classDef nvidiaGreen fill:#76b900,stroke:#000,stroke-width:2px,color:#fff
classDef nvidiaBlue fill:#0071c5,stroke:#000,stroke-width:2px,color:#fff
classDef nvidiaBlack fill:#000,stroke:#fff,stroke-width:2px,color:#fff
classDef nvidiaGray fill:#5e5e5e,stroke:#000,stroke-width:2px,color:#fff
classDef nvidiaLightGray fill:#cdcdcd,stroke:#000,stroke-width:2px,color:#000
class Orch nvidiaLightGray
class HTTP,ZMQ nvidiaGreen
class SM nvidiaBlack
class MJR nvidiaBlue
class MC,NIXL nvidiaBlue
class EPLB nvidiaGray
class R0,RN nvidiaGray
Loading
Pre-allocation with max_ep_size
At server launch, we pre-allocate active_ranks and MoE backend resources to max_ep_size (via --max-ep-size):
The zeros beyond ep_size are empty slots with no processes running and no connections established. But the tensor and buffers are already sized for up to 16 ranks. Which zeros the system actively monitors (vs ignores) is controlled by effective_ep_size, defined in detail below.
MoE backends like NIXL also need to know the maximum number of ranks upfront to allocate their internal resources (RDMA buffers, metadata structures, per-rank connection slots). These can't be resized on the fly without tearing down and rebuilding the buffer. So --max-ep-size does two things: it sizes the active_ranks tensor for SGLang, and it tells the MoE backend how much capacity to prepare for. Scaling after that only changes which connections are active, not the buffer layout.
Core idea: Scale = Extend + Recover
Scale-up reuses the recovery join flow, with one extra step at the front:
Steps 3 through 10 are identical between recovery and scale. The only differences are the group-extend at the start and how we handle EPLB afterwards.
New ranks declare their intent when they launch. This replaces PR #15771's boolean --elastic-ep-rejoin flag with a multi-value --ep-join-mode, where --ep-join-mode recover is equivalent to the old --elastic-ep-rejoin:
sglang serve ... --ep-join-mode scale # brand new rank (this RFC)
sglang serve ... --ep-join-mode recover # same as --elastic-ep-rejoin (PR #15771)
Mode
EPLB behavior
active_ranks
scale
Fire: redistribute experts to include new rank
Set new slot to 1
recover
Skip: sync weights directly (avoids stale metadata issue)
Reset to original world
activate_ranks: unified abstraction
Both recovery and scale-up end up doing the same thing per process group: "accept these ranks." We wrap that in activate_ranks(pg, rank_ids):
defactivate_ranks(pg, rank_ids):
# Backend-specific acceptance. Publishes sync state so# joining ranks can complete their join_group()recover_ranks(pg, rank_ids)
The poll loop calls activate_ranks per PG for both cases. Everything that differs (EPLB handling, MoE buffer connections) happens explicitly in the poll loop after the PG loop, not hidden inside activate_ranks.
ElasticEPStateManager: proposed API
ElasticEPStateManager is the singleton that owns active_ranks and coordinates scaling state:
Method
What it does
set_effective_ep_size(N)
Sets how many slots the poll loop should scan.
get_effective_ep_size()
Returns the current effective size (may include pending/unjoined slots).
get_num_active_ranks()
Count of 1s in active_ranks, i.e. the actually-serving ranks.
original_ep_size
The ep_size from server launch. Used to tell recovery apart from scale (is the joining rank within the original world or beyond it?).
snapshot_active_to_last()
Copies active_ranks → last_active_ranks so model_runner can detect changes next forward pass.
on_scale(new_ep_size)
Sets _scale_to for the NIXL dispatch path. Called after activate_ranks for scale-up, or directly in the HTTP handler for scale-down.
activate_ranks(pg, rank_ids) is a separate per-PG function, not on the state manager.
Why we need the HTTP API at all: The poll loop only scans active_ranks[0..effective_ep_size). In recovery, the crashed rank's slot is already in that range, so the zero shows up automatically. For scale-up, the new slots are beyondeffective_ep_size and the poll loop can't see them. The HTTP call (POST /scale_elastic_ep) extends the group and bumps effective_ep_size, which makes the new slots visible to the poll loop. After that, everything is the same as recovery.
Scale-Up Flow
Dependency chain
Each step below depends on the one before it:
POST /scale_elastic_ep
→ extend_group_size_to(N) grow group so new slots exist
→ new rank's join_group() can now complete (was blocked before extend)
→ activate_ranks() accept ranks, sync metadata
→ on_scale() tell NIXL about the new size
→ new rank unblocked enters its forward loop
→ MoE buffer connect both sides do the TCPStore handshake
→ EPLB rebalance experts get redistributed
The root of the chain is extend_group_size_to: nothing can happen before the scale API call triggers it.
On existing ranks
# ── NEW: HTTP handler (via ZMQ pipeline), returns right away ──
+ def handle_scale_elastic_ep(new_tp_size):+ for pg in iter_live_process_groups():+ extend_group_size_to(pg, new_tp_size)+ ElasticEPStateManager.set_effective_ep_size(new_tp_size)+ # Don't trigger MoE buffer connections yet. See ordering section below.
# Every forward pass, non-blocking poll
# (was maybe_recover_ep_ranks in PR #15771, renamed to handle both paths)
def maybe_join_ep_ranks():
- ranks_to_join = [i for i in range(len(active_ranks)) if not active_ranks[i]]+ ranks_to_join = [i for i in range(effective_ep_size) if not active_ranks[i]]
if not ranks_to_join:
return
# NOTE: get_peer_state is an allreduce, all active ranks must call it.
# See open questions re: rank agreement before entering this collective.
if not all(get_peer_state(ranks_to_join)): # existing from PR #15771
return
# Accept into each process group
for pg in iter_live_process_groups(): # existing from PR #15771
activate_ranks(pg, ranks_to_join) # wraps recover_ranks
broadcast_expert_location_metadata() # existing from PR #15771
- eplb_manager.reset_generator() # PR #15771: always reset+ # NEW: only reset EPLB for recovery (stale metadata).+ # Scale-up lets EPLB fire normally (no stale data).+ if is_recovery_join(ranks_to_join):+ eplb_manager.reset_generator()+ # NEW: tell NIXL about the new size (after activate_ranks, not in+ # HTTP handler to avoid deadlock, see ordering section).+ on_scale(ElasticEPStateManager.get_effective_ep_size())
ElasticEPStateManager.snapshot_active_to_last() # existing from PR #15771
# 1. Create process group in extension mode (doesn't wait for peers)init_process_group(isExtension=True)
# 2. Load model from disk (completely independent, no PG needed)load_model()
# 3. Capture CUDA graphscapture_cuda_graphs()
# 4. Join all process groups# This blocks until existing ranks call extend + activate_ranksjoin_process_groups()
# 5. Get expert placement info from existing ranksbroadcast_expert_location_metadata()
# 6. Start serving (NIXL buffer connects on first MoE dispatch)
MoE buffer connection ordering
Since connect_ranks() is a two-sided handshake, both sides need to be in the call at the same time, so we have to be careful about when we trigger it:
1. extend_group_size_to(N), effective_ep_size = N HTTP handler
2. get_peer_state() poll forward loop, non-blocking
3. join_group() completes new rank finishes PG-level connect
4. activate_ranks() unblocks the new rank
5. on_scale(N) → _scale_to = N tells NIXL to update connections
6. New rank: first dispatch → publishes its NIXL metadata
7. Existing rank: next dispatch → connect_ranks() → both sides handshake
Why on_scale() has to come after step 4, not earlier: If we called it in the HTTP handler, existing ranks would try connect_ranks on their next MoE dispatch. But connect_ranks blocks waiting for the new rank's TCPStore key, and the new rank can't publish that key until join_group() finishes, which needs activate_ranks() from the existing ranks. That's a circular wait = deadlock.
Between steps 6 and 7, existing ranks briefly block in MoE dispatch while waiting for the new rank's NIXL key. In practice this is a few seconds since the new rank is already in its forward loop by then.
NIXL dispatch path: lazy connection update
On every MoE dispatch, get_nixl_buffer() checks if the connections need updating:
When nothing is scaling, this is a single integer compare, zero cost. When scaling is happening, connect_ranks runs once on the first dispatch after on_scale().
New ranks allocate their NixlEPBuffer (pre-sized to max_ep_size) on first get_nixl_buffer() call and connect to all active ranks.
effective_ep_size: what's managed vs what's reserved
effective_ep_size is the upper bound of the poll range: the number of rank slots the system actively manages (including both active ranks and pending/faulted slots that the poll loop monitors). It is NOT the number of currently serving ranks (that's get_num_active_ranks()), and it's NOT the hard maximum (that's max_ep_size).
At launch: effective_ep_size = ep_size (e.g., 4)
After POST /scale_elastic_ep {new_tp_size: 8}: effective_ep_size = 8
After scale-down to 6: effective_ep_size = 6
Slots beyond effective_ep_size are reserved and invisible to the poll loop:
At launch (ep_size=4, effective_ep_size=4, max_ep_size=16):
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
← active → ← reserved (not polled) →
poll range [0..4), all active, nothing to poll for
After scale API (effective_ep_size=8):
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
← active → ← poll zone → ← reserved →
poll range [0..8), zeros at [4..7] trigger get_peer_state
After new ranks join:
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
← all active → ← reserved →
poll range [0..8), all active, nothing to poll for
The poll loop only looks at [0..effective_ep_size). This way we don't waste time polling for ranks that nobody has asked to add yet.
A zero in active_ranks means different things depending on where it sits:
State
active_ranks[i]
Where
Meaning
Active
1
anywhere
Rank is serving, dispatchers route to it
Failed
0
i < effective_ep_size
Was active, crashed. Poll loop monitors for recovery
Pending
0
i < effective_ep_size
Never active. Poll loop monitors for scale-up join
Reserved
0
i >= effective_ep_size
Slot exists in the tensor but not managed yet
The tensor itself doesn't distinguish failed from pending; both are zeros within the poll range. The poll loop treats them the same (poll, activate when ready). The distinction only matters for post-join EPLB behavior (see is_recovery_join).
Scale-Up Sequence Diagram
%%{init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#76b900',
'primaryTextColor': '#fff',
'primaryBorderColor': '#000',
'lineColor': '#000',
'secondaryColor': '#0071c5',
'tertiaryColor': '#5e5e5e',
'background': '#fff',
'mainBkg': '#76b900',
'secondBkg': '#0071c5',
'tertiaryBkg': '#5e5e5e'
}
}}%%
sequenceDiagram
participant Orch as Orchestrator
participant HTTP as HTTP Server
participant SM as StateManager
participant MR as model_runner
participant EPLB as EPLB
participant R4 as New Rank 4
Note over Orch,R4: Phase 1 - Launch new ranks
Orch->>R4: sglang serve ... --ep-join-mode scale
R4->>R4: init (extension mode) + load model
Note over Orch,R4: Phase 2 - Scale request
Orch->>HTTP: POST /scale_elastic_ep {new_tp_size: 8}
HTTP->>SM: extend_group_size_to(8), effective_ep_size = 8
HTTP-->>Orch: scaling initiated
Note over Orch,R4: Phase 3 - Poll (healthy ranks keep serving at ep=4)
MR->>MR: forward() → maybe_join_ep_ranks()
MR->>SM: get_peer_state([4..7]) → not ready
R4->>SM: join_group() - connects at PG level
MR->>SM: get_peer_state([4..7]) → all ready
Note over Orch,R4: Phase 4 - Accept + sync
MR->>SM: activate_ranks([4..7]) per PG
MR->>R4: broadcast expert metadata
SM->>SM: active_ranks[4:8] = 1
Note over Orch,R4: Phase 5 - MoE buffer connect
MR->>MR: on_scale(8)
R4->>R4: first dispatch → publishes buffer metadata
MR->>MR: next dispatch → connect_ranks([4..7]) → handshake
Note over Orch,R4: Phase 6 - EPLB
MR->>EPLB: rebalance()
EPLB->>EPLB: redistribute experts to 8 ranks
Note over MR,R4: Inference continues at ep_size=8
Loading
Scale-Down Flow
Scale-down is simpler than scale-up since there's no join flow needed. The main concern is graceful shutdown: if we just kill ranks, the FT path would detect them as faulted and try to recover them every forward pass. So we need to explicitly shrink effective_ep_size to stop the poll loop from scanning those slots.
defhandle_scale_down(new_tp_size):
# 1. Stop routing to departing ranks. Dispatchers and EPLB read this.active_ranks[new_tp_size:effective_ep_size] =0# 2. EPLB fires on next forward pass, consolidates experts onto# remaining ranks before those ranks disconnect.# 3. Shrink poll range. Without this, FT would treat the killed# ranks as faults and try to recover them every forward pass.ElasticEPStateManager.set_effective_ep_size(new_tp_size)
# 4. Wait for in-flight work on departing ranks to finish.# Requests already owned by those ranks complete naturally.# 5. Disconnect NIXL (one-sided, no handshake needed).# Safe now because departing ranks have finished their work.on_scale(new_tp_size)
# 6. Orchestrator kills the departing ranks.
%%{init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#76b900',
'primaryTextColor': '#fff',
'primaryBorderColor': '#000',
'lineColor': '#000',
'secondaryColor': '#0071c5',
'tertiaryColor': '#5e5e5e',
'background': '#fff',
'mainBkg': '#76b900',
'secondBkg': '#0071c5',
'tertiaryBkg': '#5e5e5e'
}
}}%%
sequenceDiagram
participant Orch as Orchestrator
participant HTTP as HTTP Server
participant SM as StateManager
participant MR as model_runner
participant EPLB as EPLB
participant R7 as Removing Rank 7
Orch->>HTTP: POST /scale_elastic_ep {new_tp_size: 6}
HTTP->>SM: active_ranks[6:8] = 0
HTTP->>SM: set_effective_ep_size(6)
MR->>EPLB: rebalance() → consolidate experts onto 6 ranks
Note over R7: In-flight requests finish naturally
MR->>MR: on_scale(6) → disconnect_ranks([6,7])
Orch->>R7: shutdown
Note over MR: Inference continues at ep_size=6
Loading
User-Facing API
Server launch
# Start with 4 GPUs, allow scaling up to 16
sglang serve --model-path <model> --tp 4 --dp 4 \
--elastic-ep-backend mooncake --moe-a2a-backend nixl \
--max-ep-size 16
# Later: add a new rank (scale-up)
sglang serve --model-path <model> \
--elastic-ep-backend mooncake --moe-a2a-backend nixl \
--ep-join-mode scale --node-rank 1
# Or: bring back a crashed rank (recovery)
sglang serve --model-path <model> \
--elastic-ep-backend mooncake --moe-a2a-backend nixl \
--ep-join-mode recover --node-rank 1
--max-ep-size 16 tells SGLang to pre-allocate everything for up to 16 ranks. The process group starts at 4 and grows on demand via extend_group_size_to().
HTTP API
POST /scale_elastic_ep → kick off a scale operation
POST /is_scaling_elastic_ep → check if scaling is in progress
Request:
{ "new_tp_size": 8 }
Response:
{
"message": "Scaling initiated from 4 to 8, polling for new ranks",
"old_tp_size": 4,
"new_tp_size": 8
}
This returns right away; it just extends the group and sets effective_ep_size. The actual rank joining happens in the background through the poll loop.
Internal: ZMQ Pipeline
The HTTP handler can't call ElasticEPStateManager directly; SGLang routes control messages through ZMQ. The scale request follows the same pattern as UpdateWeightReqInput:
This requires defining ScaleElasticEPReqInput / ScaleElasticEPReqOutput message types in io_struct.py and wiring the handlers in TokenizerManager and Scheduler.
dp_attention Topology
In dp_attention mode, each GPU is its own DP shard:
Before (tp=4, dp=4, ep=4):
DP0/TP0/EP0 DP1/TP1/EP1 DP2/TP2/EP2 DP3/TP3/EP3
After scale to 8:
DP0/TP0/EP0 ... DP3/TP3/EP3 DP4/TP4/EP4 ... DP7/TP7/EP7
New shards create their own attention groups at startup. Existing shards don't change.
Assumptions
What we assume
Backend
Mooncake for PG elasticity, NIXL for MoE transport
Topology
dp_attention mode with ep_size == tp_size
Rank ordering
Sequential: scale-up always appends new ranks at effective_ep_size, never fills fault gaps in the middle. Gaps are handled by recovery, not scaling.
Pre-allocation
active_ranks and MoE buffers sized to max_ep_size from the start
Process readiness
New ranks need to reach join_group() within a reasonable time after the group is extended
Orchestration
Launching and killing rank processes is someone else's job
Open Questions and Out of Scope
Still need to figure out
Topic
Question
EPLB after scale-up
Recovery skips EPLB because of stale metadata. Scale-up adds brand-new ranks with no stale data, so EPLB should be fine, but we should verify this.
Scale-up with faulted ranks
Should the scale API reject a request if there are faulted (inactive) ranks within the current effective_ep_size? Allowing it works (scale appends past the gap), but blocking until all ranks are healthy simplifies the flow and avoids mixed recovery + scale in the same poll cycle.
Rank agreement before get_peer_state
get_peer_state() is an allreduce, so all active ranks must call it. But each rank decides independently whether to call it, based on its local ranks_to_join. If ranks process the scale signal on different forward passes, some enter the allreduce and others skip, causing a hang. This same race exists in PR #15771's recovery path (fault detected on different passes). Two possible fixes: (a) Timeout on allreduce: add a timeout to get_peer_state's work->wait() (already supported by Mooncake). If timeout, treat as "not ready" and retry next pass. Simple, but the ahead rank wastes up to N seconds blocked. (b) Staged barrier (vLLM-style): before entering get_peer_state, ranks write intent to a shared store and wait briefly. If not all agree, skip and retry. Zero-wait when ranks agree, costs one extra forward pass when they don't. Better for latency, more code to implement.
CUDA graph recapture for existing ranks
After a topology change, do existing ranks need to recapture CUDA graphs? PR #15771 recovery doesn't recapture and works. Depends on whether any graph-captured operations encode the EP topology.
Drain for scale-down
Scale-up doesn't need drain (non-blocking). Scale-down might, since we need to move experts off the departing ranks before they shut down.
world_size for new ranks
When a new rank calls init_process_group, should it pass the extended size or the original? The extension flag might bypass this.
Not in scope for this RFC
P2P weight transfer: loading from disk is simpler; P2P would need a "weights ready" barrier.
Rank-explicit API: add_ranks([4,5,6,7]) or remove_rank(6) style (see vLLM PR #38862 for rank-specific removal during scale-down). Currently not supported; scale-down removes from the tail of the active range.
The foundation for this RFC. We reuse join_group, recover_ranks, get_peer_state, and the non-blocking poll loop. Scale-up adds extend_group_size_to as the one new call. We propose --ep-join-mode {scale|recover} to unify both paths under one flag.
vLLM NixlEPAll2AllManager.get_handle()
Uses a similar lazy dispatch-path pattern: checks buffer size against PG size on every call. Inspired the _scale_to != _connected_ep_size check here.
Reference implementation of elastic EP communication built on NIXL's device API. Provides connect_ranks() / disconnect_ranks() per-pair APIs and an elastic test suite with multi-phase scaling plans. Our NIXL buffer integration follows the same pattern.
Motivation
Today, large MoE models (like DeepSeek V3) run with a fixed number of GPUs decided at launch time. The process groups, expert sharding, and dispatch paths are all set up once and never change.
That's fine until it isn't:
Right now the only option is to restart the whole server with a different topology. That means killing in-flight requests, waiting for model reload, and hoping nothing else breaks.
This RFC proposes adding and removing EP ranks at runtime, without restarting.
Background
Expert Parallelism in SGLang
EP splits the MoE expert layers across
ep_sizeGPU ranks. Each rank holds a subset of experts. During a forward pass, tokens get dispatched to the right expert rank (via NIXL or Mooncake all-to-all), computed locally, then combined back.active_ranksSGLang already has a tensor that tracks which EP ranks are alive:
This tensor is read by MoE dispatchers, EPLB, process groups, the scheduler, and
ElasticEPStateManager. The key idea of this RFC is that scaling is just flipping bits in this tensor, the same thing fault tolerance already does.Dynamic group membership (Mooncake PG backend)
The Mooncake process group backend provides APIs for changing rank membership on the fly:
extend_group_size_to(N)get_peer_state(ranks)recover_ranks(ranks)join_group()The
isExtensionflag lets a new rank create its process group without blocking during init; it defers the actual connection tojoin_group()later.PR #15771: Rank Recovery
PR #15771 added the ability to recover a crashed rank: the admin relaunches it, it creates process groups in extension mode, loads the model, and calls
join_group(). Meanwhile, healthy ranks keep serving and poll each forward pass withget_peer_state(). When the recovered rank is ready, they callrecover_ranks()and it rejoins.The key insight here: scaling can piggyback on this exact same flow. The only difference is that recovery fills an existing slot (the rank was there before), while scale-up adds slots beyond the original group. That means we need one extra call (
extend_group_size_to()) to make the group aware of the new slots before the recovery-style join can kick in.MoE buffer connections (NIXL)
NIXL manages connections between ranks via
connect_ranks()anddisconnect_ranks()work per-pair, so there's no need to rebuild everything. Buffers are pre-allocated tomax_ep_sizeat init, so scaling only changes which connections are active.One important detail:
connect_ranks()is a two-sided handshake. This means both sides need to be callingconnect_ranks()at roughly the same time.Design Considerations
Reusing the recovery interface
PR #15771 already solves the hard problem of dynamically joining a rank into a live serving cluster: extension mode init,
join_group, non-blocking polling,recover_ranks. That machinery works for any rank that wasn't present at startup, not just crashed ones. Rather than building a separate scaling system, we add one call (extend_group_size_to) to grow the group and reuse everything else. This keeps the codebase simpler and means both paths benefit from the same bug fixes.API style
We considered three ways to express a scale operation:
{ "new_tp_size": 8 }extend_group_size_to(N){ "rank_ids": [4,5,6,7] }ElasticEPStateManager, buffer backends, and DP shard validation{ "count": 4 }We go with absolute target for the initial design. Rank-explicit can come later once the core path is proven.
Rank discovery
Existing ranks discover new ranks the same way they discover recovered ranks in #15771: a non-blocking
get_peer_state()poll that runs at the end of each forward pass. The scale API doesn't introduce a new discovery mechanism; it just widens the poll range by bumpingeffective_ep_sizeso the new slots become visible.MoE buffer connection ordering
NIXL connections update lazily: each dispatch checks
_scale_to != _connected_ep_sizeand callsconnect_ranksif needed. The critical constraint is thaton_scale()must be called afteractivate_ranks(), not in the HTTP handler. If triggered too early,connect_ranksblocks in the dispatch path waiting for the new rank's metadata, but the new rank can't publish that metadata until itsjoin_group()completes, which needsactivate_ranks(). Callingon_scaleearlier creates a circular wait. Details in the MoE buffer ordering section below.Distinguishing scale from recovery
The joining rank declares its intent with
--ep-join-mode scaleor--ep-join-mode recover. On the existing rank side, we need a way to determine whether joining ranks are recovering (stale metadata → skip EPLB) or scaling (fresh → let EPLB fire). We wrap this inis_recovery_join(rank_ids). The exact implementation is TBD (could use rank index, PG join metadata, or the flag itself), but the contract is clear: "were these ranks previously active?"Design Proposal
System Architecture
%%{init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#76b900', 'primaryTextColor': '#fff', 'primaryBorderColor': '#000', 'lineColor': '#000', 'secondaryColor': '#0071c5', 'tertiaryColor': '#5e5e5e', 'background': '#fff', 'mainBkg': '#76b900', 'secondBkg': '#0071c5', 'tertiaryBkg': '#5e5e5e' } }}%% flowchart TB subgraph external ["External"] Orch["Orchestrator"] end subgraph sglang_server ["SGLang Server"] HTTP["POST /scale_elastic_ep"] ZMQ["ZMQ Pipeline"] end subgraph state ["State Manager"] SM["ElasticEPStateManager"] end subgraph poll ["Forward Loop"] MJR["maybe_join_ep_ranks"] end subgraph backends ["Backends"] MC["Process Group"] NIXL["NixlEPBuffer"] end subgraph auto ["Automatic"] EPLB["EPLB"] end subgraph ranks ["GPU Ranks"] R0["Existing Ranks"] RN["New Ranks"] end Orch -->|"launch ranks"| RN Orch -->|"POST /scale"| HTTP HTTP --> ZMQ --> SM MJR -->|"reads effective_ep_size"| SM MJR -->|"activate_ranks"| MC MJR -->|"on_scale"| NIXL MJR -->|"rebalance"| EPLB MC --> R0 MC --> RN NIXL --> R0 NIXL --> RN RN -->|"join_group"| MC classDef nvidiaGreen fill:#76b900,stroke:#000,stroke-width:2px,color:#fff classDef nvidiaBlue fill:#0071c5,stroke:#000,stroke-width:2px,color:#fff classDef nvidiaBlack fill:#000,stroke:#fff,stroke-width:2px,color:#fff classDef nvidiaGray fill:#5e5e5e,stroke:#000,stroke-width:2px,color:#fff classDef nvidiaLightGray fill:#cdcdcd,stroke:#000,stroke-width:2px,color:#000 class Orch nvidiaLightGray class HTTP,ZMQ nvidiaGreen class SM nvidiaBlack class MJR nvidiaBlue class MC,NIXL nvidiaBlue class EPLB nvidiaGray class R0,RN nvidiaGrayPre-allocation with
max_ep_sizeAt server launch, we pre-allocate
active_ranksand MoE backend resources tomax_ep_size(via--max-ep-size):The zeros beyond
ep_sizeare empty slots with no processes running and no connections established. But the tensor and buffers are already sized for up to 16 ranks. Which zeros the system actively monitors (vs ignores) is controlled byeffective_ep_size, defined in detail below.MoE backends like NIXL also need to know the maximum number of ranks upfront to allocate their internal resources (RDMA buffers, metadata structures, per-rank connection slots). These can't be resized on the fly without tearing down and rebuilding the buffer. So
--max-ep-sizedoes two things: it sizes theactive_rankstensor for SGLang, and it tells the MoE backend how much capacity to prepare for. Scaling after that only changes which connections are active, not the buffer layout.Core idea: Scale = Extend + Recover
Scale-up reuses the recovery join flow, with one extra step at the front:
extend_group_size_to(N)effective_ep_size = N--ep-join-mode recover--ep-join-mode scalejoin_group, poll,recover_ranks, metadata syncactive_ranksactive_ranksSteps 3 through 10 are identical between recovery and scale. The only differences are the group-extend at the start and how we handle EPLB afterwards.
New ranks declare their intent when they launch. This replaces PR #15771's boolean
--elastic-ep-rejoinflag with a multi-value--ep-join-mode, where--ep-join-mode recoveris equivalent to the old--elastic-ep-rejoin:active_ranksscalerecoveractivate_ranks: unified abstractionBoth recovery and scale-up end up doing the same thing per process group: "accept these ranks." We wrap that in
activate_ranks(pg, rank_ids):The poll loop calls
activate_ranksper PG for both cases. Everything that differs (EPLB handling, MoE buffer connections) happens explicitly in the poll loop after the PG loop, not hidden insideactivate_ranks.ElasticEPStateManager: proposed APIElasticEPStateManageris the singleton that ownsactive_ranksand coordinates scaling state:set_effective_ep_size(N)get_effective_ep_size()get_num_active_ranks()active_ranks, i.e. the actually-serving ranks.original_ep_sizeep_sizefrom server launch. Used to tell recovery apart from scale (is the joining rank within the original world or beyond it?).snapshot_active_to_last()active_ranks→last_active_rankssomodel_runnercan detect changes next forward pass.on_scale(new_ep_size)_scale_tofor the NIXL dispatch path. Called afteractivate_ranksfor scale-up, or directly in the HTTP handler for scale-down.activate_ranks(pg, rank_ids)is a separate per-PG function, not on the state manager.Why we need the HTTP API at all: The poll loop only scans
active_ranks[0..effective_ep_size). In recovery, the crashed rank's slot is already in that range, so the zero shows up automatically. For scale-up, the new slots are beyondeffective_ep_sizeand the poll loop can't see them. The HTTP call (POST /scale_elastic_ep) extends the group and bumpseffective_ep_size, which makes the new slots visible to the poll loop. After that, everything is the same as recovery.Scale-Up Flow
Dependency chain
Each step below depends on the one before it:
The root of the chain is
extend_group_size_to: nothing can happen before the scale API call triggers it.On existing ranks
On new ranks
MoE buffer connection ordering
Since
connect_ranks()is a two-sided handshake, both sides need to be in the call at the same time, so we have to be careful about when we trigger it:Why
on_scale()has to come after step 4, not earlier: If we called it in the HTTP handler, existing ranks would tryconnect_rankson their next MoE dispatch. Butconnect_ranksblocks waiting for the new rank's TCPStore key, and the new rank can't publish that key untiljoin_group()finishes, which needsactivate_ranks()from the existing ranks. That's a circular wait = deadlock.Between steps 6 and 7, existing ranks briefly block in MoE dispatch while waiting for the new rank's NIXL key. In practice this is a few seconds since the new rank is already in its forward loop by then.
NIXL dispatch path: lazy connection update
On every MoE dispatch,
get_nixl_buffer()checks if the connections need updating:When nothing is scaling, this is a single integer compare, zero cost. When scaling is happening,
connect_ranksruns once on the first dispatch afteron_scale().New ranks allocate their NixlEPBuffer (pre-sized to
max_ep_size) on firstget_nixl_buffer()call and connect to all active ranks.effective_ep_size: what's managed vs what's reservedeffective_ep_sizeis the upper bound of the poll range: the number of rank slots the system actively manages (including both active ranks and pending/faulted slots that the poll loop monitors). It is NOT the number of currently serving ranks (that'sget_num_active_ranks()), and it's NOT the hard maximum (that'smax_ep_size).effective_ep_size = ep_size(e.g., 4)POST /scale_elastic_ep {new_tp_size: 8}:effective_ep_size = 8effective_ep_size = 6Slots beyond
effective_ep_sizeare reserved and invisible to the poll loop:The poll loop only looks at
[0..effective_ep_size). This way we don't waste time polling for ranks that nobody has asked to add yet.A zero in
active_ranksmeans different things depending on where it sits:active_ranks[i]i < effective_ep_sizei < effective_ep_sizei >= effective_ep_sizeThe tensor itself doesn't distinguish failed from pending; both are zeros within the poll range. The poll loop treats them the same (poll, activate when ready). The distinction only matters for post-join EPLB behavior (see
is_recovery_join).Scale-Up Sequence Diagram
%%{init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#76b900', 'primaryTextColor': '#fff', 'primaryBorderColor': '#000', 'lineColor': '#000', 'secondaryColor': '#0071c5', 'tertiaryColor': '#5e5e5e', 'background': '#fff', 'mainBkg': '#76b900', 'secondBkg': '#0071c5', 'tertiaryBkg': '#5e5e5e' } }}%% sequenceDiagram participant Orch as Orchestrator participant HTTP as HTTP Server participant SM as StateManager participant MR as model_runner participant EPLB as EPLB participant R4 as New Rank 4 Note over Orch,R4: Phase 1 - Launch new ranks Orch->>R4: sglang serve ... --ep-join-mode scale R4->>R4: init (extension mode) + load model Note over Orch,R4: Phase 2 - Scale request Orch->>HTTP: POST /scale_elastic_ep {new_tp_size: 8} HTTP->>SM: extend_group_size_to(8), effective_ep_size = 8 HTTP-->>Orch: scaling initiated Note over Orch,R4: Phase 3 - Poll (healthy ranks keep serving at ep=4) MR->>MR: forward() → maybe_join_ep_ranks() MR->>SM: get_peer_state([4..7]) → not ready R4->>SM: join_group() - connects at PG level MR->>SM: get_peer_state([4..7]) → all ready Note over Orch,R4: Phase 4 - Accept + sync MR->>SM: activate_ranks([4..7]) per PG MR->>R4: broadcast expert metadata SM->>SM: active_ranks[4:8] = 1 Note over Orch,R4: Phase 5 - MoE buffer connect MR->>MR: on_scale(8) R4->>R4: first dispatch → publishes buffer metadata MR->>MR: next dispatch → connect_ranks([4..7]) → handshake Note over Orch,R4: Phase 6 - EPLB MR->>EPLB: rebalance() EPLB->>EPLB: redistribute experts to 8 ranks Note over MR,R4: Inference continues at ep_size=8Scale-Down Flow
Scale-down is simpler than scale-up since there's no join flow needed. The main concern is graceful shutdown: if we just kill ranks, the FT path would detect them as faulted and try to recover them every forward pass. So we need to explicitly shrink
effective_ep_sizeto stop the poll loop from scanning those slots.%%{init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#76b900', 'primaryTextColor': '#fff', 'primaryBorderColor': '#000', 'lineColor': '#000', 'secondaryColor': '#0071c5', 'tertiaryColor': '#5e5e5e', 'background': '#fff', 'mainBkg': '#76b900', 'secondBkg': '#0071c5', 'tertiaryBkg': '#5e5e5e' } }}%% sequenceDiagram participant Orch as Orchestrator participant HTTP as HTTP Server participant SM as StateManager participant MR as model_runner participant EPLB as EPLB participant R7 as Removing Rank 7 Orch->>HTTP: POST /scale_elastic_ep {new_tp_size: 6} HTTP->>SM: active_ranks[6:8] = 0 HTTP->>SM: set_effective_ep_size(6) MR->>EPLB: rebalance() → consolidate experts onto 6 ranks Note over R7: In-flight requests finish naturally MR->>MR: on_scale(6) → disconnect_ranks([6,7]) Orch->>R7: shutdown Note over MR: Inference continues at ep_size=6User-Facing API
Server launch
--max-ep-size 16tells SGLang to pre-allocate everything for up to 16 ranks. The process group starts at 4 and grows on demand viaextend_group_size_to().HTTP API
Request:
{ "new_tp_size": 8 }Response:
{ "message": "Scaling initiated from 4 to 8, polling for new ranks", "old_tp_size": 4, "new_tp_size": 8 }This returns right away; it just extends the group and sets
effective_ep_size. The actual rank joining happens in the background through the poll loop.Internal: ZMQ Pipeline
The HTTP handler can't call
ElasticEPStateManagerdirectly; SGLang routes control messages through ZMQ. The scale request follows the same pattern asUpdateWeightReqInput:This requires defining
ScaleElasticEPReqInput/ScaleElasticEPReqOutputmessage types inio_struct.pyand wiring the handlers inTokenizerManagerandScheduler.dp_attention Topology
In
dp_attentionmode, each GPU is its own DP shard:Constraint:
(new_ep_size - effective_ep_size) % attn_tp_size == 0New shards create their own attention groups at startup. Existing shards don't change.
Assumptions
dp_attentionmode withep_size == tp_sizeeffective_ep_size, never fills fault gaps in the middle. Gaps are handled by recovery, not scaling.active_ranksand MoE buffers sized tomax_ep_sizefrom the startjoin_group()within a reasonable time after the group is extendedOpen Questions and Out of Scope
Still need to figure out
effective_ep_size? Allowing it works (scale appends past the gap), but blocking until all ranks are healthy simplifies the flow and avoids mixed recovery + scale in the same poll cycle.get_peer_stateget_peer_state()is an allreduce, so all active ranks must call it. But each rank decides independently whether to call it, based on its localranks_to_join. If ranks process the scale signal on different forward passes, some enter the allreduce and others skip, causing a hang. This same race exists in PR #15771's recovery path (fault detected on different passes). Two possible fixes: (a) Timeout on allreduce: add a timeout toget_peer_state'swork->wait()(already supported by Mooncake). If timeout, treat as "not ready" and retry next pass. Simple, but the ahead rank wastes up to N seconds blocked. (b) Staged barrier (vLLM-style): before enteringget_peer_state, ranks write intent to a shared store and wait briefly. If not all agree, skip and retry. Zero-wait when ranks agree, costs one extra forward pass when they don't. Better for latency, more code to implement.world_sizefor new ranksinit_process_group, should it pass the extended size or the original? The extension flag might bypass this.Not in scope for this RFC
add_ranks([4,5,6,7])orremove_rank(6)style (see vLLM PR #38862 for rank-specific removal during scale-down). Currently not supported; scale-down removes from the tail of the active range.Related Work
join_group,recover_ranks,get_peer_state, and the non-blocking poll loop. Scale-up addsextend_group_size_toas the one new call. We propose--ep-join-mode {scale|recover}to unify both paths under one flag.NixlEPAll2AllManager.get_handle()_scale_to != _connected_ep_sizecheck here.connect_ranks()/disconnect_ranks()per-pair APIs and an elastic test suite with multi-phase scaling plans. Our NIXL buffer integration follows the same pattern.