Skip to content
Draft
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
31a85e5
Add debug.md cherry-picked from PR #20 (copilot/sub-pr-19)
Copilot May 9, 2026
7952367
Revert "Add debug.md cherry-picked from PR #20 (copilot/sub-pr-19)"
Copilot May 9, 2026
a5e1f42
Add debug.md for XPU EP hang diagnosis
Copilot May 9, 2026
6311a97
Address PR review: remove MoE logs section, add patch files, add Fix …
Copilot May 9, 2026
a365c42
Add trace logging infrastructure for XPU EP hang diagnosis
Copilot May 9, 2026
1c85e35
Address code review: improve path variable readability in xpu_communi…
Copilot May 9, 2026
42a4710
Replace logger.debug with print for trace logging
Copilot May 9, 2026
4427d09
Reduce trace logging to essential xpu_communicator.py probes only
Copilot May 9, 2026
47e4809
Apply all fix patches to source files and remove patch files
Copilot May 9, 2026
918c2c3
Apply Fix 5 to xpu_communicator.py: batch list-input collectives
Copilot May 9, 2026
4abf98a
Fix Fix 5 patch: use direct return for tensor path, remove unnecessar…
Copilot May 9, 2026
af5faaf
Remove mismatch-checking COUNTER trace probes from xpu_communicator.py
Copilot May 9, 2026
7ae30fe
Add trace probe for Fix 5 same-dtype shape verification
Copilot May 9, 2026
5193eb7
Fix 5 trace: check anomalies only, don't print all data
Copilot May 9, 2026
74ded3f
Remove Fix5 same-dtype trace probes after confirming no anomalies
Copilot May 9, 2026
0aedc6a
Remove Fix 5 from xpu_communicator.py, revert to original sequential …
Copilot May 9, 2026
e3c9947
Remove Fix 4: revert all_gather_into_tensor back to original all_gath…
Copilot May 9, 2026
ad166b5
Remove xpu_communicator.py from PR: no changes remain after Fix 4 rem…
Copilot May 9, 2026
cf87e89
Add decode-phase "!!!!" output analysis to debug.md
Copilot May 9, 2026
bc2016b
Add decode-stage PAD_SLOT_ID diagnostic check for DP padding tokens
Copilot May 9, 2026
b6e3f24
Replace logger.error with print for PAD_SLOT_ID diagnostic trace
Copilot May 9, 2026
761c52e
Add decode-stage DP padding checker: print error if Fix 1 forces unne…
Copilot May 9, 2026
cfca5b6
Remove PAD_SLOT_ID checker patch from gpu_model_runner.py
Copilot May 9, 2026
8cd1920
Add reduce_scatterv padding diagnostic in combine()
Copilot May 9, 2026
6e731d6
Remove decode-stage DP padding checker diagnostic
Copilot May 9, 2026
df65fef
Enhance reduce_scatterv checker with error output for anomaly detection
Copilot May 9, 2026
5b5141b
Simplify redundant condition in reduce_scatterv checker
Copilot May 9, 2026
8c8358b
Force-zero padding positions before reduce_scatterv to prevent expert…
Copilot May 10, 2026
8af6a14
Fix force-zero patch: handle sizes/num_tokens_across_dp_cpu length mi…
Copilot May 10, 2026
ecd2e53
Add NaN/Inf detection before and after reduce_scatterv to locate NaN …
Copilot May 10, 2026
30dbda7
Fix formatting in NaN detection log output
Copilot May 10, 2026
06b02cf
Remove padding check diagnostics (REDUCE_SCATTER_CHECK and force-zero…
Copilot May 10, 2026
de74bd8
Add NaN detection after all_gatherv in dispatch() to locate NaN origi…
Copilot May 10, 2026
da8225d
Add NAN_CHECK_DISPATCH to dispatch_router_logits() for proper NaN det…
Copilot May 10, 2026
ee05974
Move NAN_CHECK_DISPATCH to prepare() in naive_dp_ep.py to ensure it e…
Copilot May 10, 2026
5161e1f
Remove old NaN/Inf detection (before/after reduce_scatterv, after dis…
Copilot May 10, 2026
4d1a206
Change torch.empty to torch.zeros in xpu_communicator.py all_gatherv …
Copilot May 10, 2026
0c9152e
Update debug.md with two-NaN-source analysis and decode-stage investi…
Copilot May 10, 2026
997d799
Change all torch.empty to torch.zeros in xpu_communicator.py to compr…
Copilot May 10, 2026
f7543d1
Add global counter to limit NaN check prints to 200 lines, add back p…
Copilot May 10, 2026
ff51398
Redesign NaN diagnostics: detail for first 50 NaN + periodic summarie…
Copilot May 10, 2026
2e13be2
Fix NaN check to skip non-floating-point tensors (quantized int8 etc)
Copilot May 10, 2026
1cfb966
Wrap NaN diagnostics in try-except to prevent any diagnostic crash; a…
Copilot May 10, 2026
7ac0d34
Revert all changes back to 5161e1f state
Copilot May 11, 2026
54a7f5d
Add NaN detection after attention output in DeepseekV2DecoderLayer
Copilot May 11, 2026
958cd83
Remove NaN checks from naive_dp_ep.py and deepseek_v2.py, add attenti…
Copilot May 11, 2026
66e172d
Move NaN attention check from qwen3_moe.py to qwen3_next.py for Qwen3.5
Copilot May 11, 2026
ae10fa5
fix(attention): zero-init attention output buffer to avoid uninit pad…
Copilot May 11, 2026
26a974b
fix(qwen3_next): zero-init self_attention_output buffer to prevent un…
Copilot May 11, 2026
b505a63
Add fine-grained NaN tracing in Qwen3NextAttention.forward() to pinpo…
Copilot May 11, 2026
77dfcc1
Remove attention output NaN checker and NAN_TRACE, add pre-attention …
Copilot May 11, 2026
a9db001
refactor(nan-check): add pre/post attention NaN trace with first-occu…
Copilot May 11, 2026
934b9a4
fix(nan-trace): use print instead of logging.error for NaN detection,…
Copilot May 11, 2026
9dce74b
Add padding vs actual token row distinction in NAN_CHECK_POST_ATTN
Copilot May 11, 2026
5ad477e
Add [ATTN_MASK_CHECK] to print seq_lens and query_start_loc before at…
Copilot May 11, 2026
f7dffcb
Optimize [ATTN_MASK_CHECK]: limit tensor-to-list to 64 entries, add m…
Copilot May 11, 2026
5ea0eba
Fix [ATTN_MASK_CHECK] and [NAN_CHECK_POST_ATTN] to properly resolve a…
Copilot May 11, 2026
b2077e1
Address code review: simplify layer_name logic and add list bounds check
Copilot May 11, 2026
a2f331e
fix: defer ATTN_MASK_CHECK flag until metadata resolves, add linear_a…
Copilot May 11, 2026
6ae7ccd
Fix [ATTN_MASK_CHECK] to only run for full_attention layers
Copilot May 11, 2026
2dbc808
Skip NaN detection and ATTN_MASK_CHECK during warmup, only run during…
Copilot May 11, 2026
e1ee9d1
Add fine-grained NaN diagnostics inside GatedDeltaNetAttention (GDN l…
Copilot May 11, 2026
e2d993c
Remove unnecessary global statement in _check_gdn_nan
Copilot May 11, 2026
1a42b2c
Remove [ATTN_MASK_CHECK] and [GDN_NAN_CHECK] diagnostics, keep only p…
Copilot May 12, 2026
6be0b13
Add xpu_arc_b60_dp_ep.py example for DP+EP inference on 4x Intel ARC …
Copilot May 12, 2026
7162b59
Clean up unnecessary changes: revert naive_dp_ep.py, gdn_linear_attn.…
Copilot May 12, 2026
3755876
Clean up debug.md: remove outdated info, update with confirmed NaN an…
Copilot May 12, 2026
44170eb
Update debug.md: add TP=4/DP=1 reference case, revise conclusion to f…
Copilot May 12, 2026
bc304ee
Add [ATTN_MASK_CHECK] one-shot diagnostic: print seq_lens and query_s…
Copilot May 12, 2026
681c61a
Update debug.md with ATTN_MASK_CHECK log analysis confirming DP paddi…
Copilot May 12, 2026
f9082e3
Clean up debug.md: remove debug directions, keep only conclusions and…
Copilot May 12, 2026
1ea756b
Fix DP padding NaN root cause: set num_actual_tokens to real token count
Copilot May 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
359 changes: 359 additions & 0 deletions debug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,359 @@
# XPU EP Hang Diagnosis - Debug Summary

## Problem Statement

vLLM with Expert Parallelism (EP) on XPU hangs during inference when using
Data Parallelism (DP). The hang manifests as a silent deadlock — the process
stops producing output with no error message.

**Config**: Qwen3.5-35B-A3B, TP=2, EP=4 (MoE dispatch/combine over XCCL),
DP=2 with DP padding enabled.

---

## Confirmed Fixes

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot generate the patch with these fix descrition.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated patch files for all fixes in patches/ directory (commit 6311a97). Each fix has its own .patch file that can be applied with git apply patches/fix*.patch. Fix 5 patch is omitted as it requires more complex refactoring of the list-path logic in xpu_communicator.py.

### Fix 1 — Force DP padding when Expert Parallelism is enabled

**Status**: ✅ CONFIRMED NEEDED and applied. All COUNTER logs show
`all_gatherv/uniform` (uniform = equal-size tensors across ranks), confirming
DP padding is in effect.

**File**: `vllm/v1/worker/dp_utils.py`

**Root cause**: Without DP padding, each DP rank processes a different number
of tokens. XCCL MoE dispatch/combine collectives require equal-size tensors.
Forcing DP padding when EP is active ensures all ranks always have the same
token count.

```diff
- should_dp_pad = synced_cudagraph_mode != 0 or should_ubatch
+ should_dp_pad = (synced_cudagraph_mode != 0 or should_ubatch
+ or parallel_config.enable_expert_parallel)
```

### Fix 2 — `num_actual_tokens` mismatch when DP padding is active

**Status**: ✅ CONFIRMED FIXED by log evidence.

**File**: `vllm/v1/worker/gpu_model_runner.py`

**Log evidence** (before fix — rank 1 mismatch):
```
[TRACE] _gdn_attention_core_xpu_impl: core_attn_out.size(0)=30, num_actual_tokens=26, match=False
```

**After fix** — all ranks show `match=True`.

**Root cause**: DP padding pads `hidden_states` to the max token count across
DP ranks (30), but `num_actual_tokens` in attention metadata remained at the
real count (26). The XPU GDN kernel asserts
`core_attn_out.size(0) == num_actual_tokens` and hangs. The fix sets
`pad_attn=True` whenever DP padding is applied, aligning `num_actual_tokens`,
slot mappings, and attention metadata with the padded count.

```diff
- pad_attn = cudagraph_mode == CUDAGraphMode.FULL
+ dp_padding_applied = num_tokens_padded > num_tokens_unpadded
+ pad_attn = cudagraph_mode == CUDAGraphMode.FULL or dp_padding_applied
```

### Fix 3 — Disable async scheduling when EP + DP is active

**Status**: ✅ APPLIED. This is a **production correctness fix**, not merely a
diagnostic aid.

**File**: `vllm/v1/worker/gpu_model_runner.py`

**Root cause (production)**: With async scheduling enabled and EP+DP active,
`AsyncGPUModelRunnerOutput` returns immediately after queuing the GPU→CPU
copy. If DP ranks advance their schedulers at different speeds, one DP rank
can enter the next iteration's `_run_ar` all-reduce before the other finishes
the current iteration's GPU work, causing a cross-iteration collective
mismatch deadlock.

**Diagnostic benefit**: With async scheduling disabled, GPU-side hangs inside
the MoE forward become visible inside `sample_tokens: bookkeeping` rather than
hiding behind the async copy queue. This confirmed the hang is GPU-side (not a
CPU/scheduler race) and narrowed it to the model forward pass.

```diff
+ if (self.use_async_scheduling
+ and self.parallel_config.enable_expert_parallel
+ and self.parallel_config.data_parallel_size > 1):
+ self.use_async_scheduling = False
```

### Fix 4 — Correct `all_gatherv` uniform path

**Status**: ✅ CONFIRMED NEEDED. The original code passed a 1-element list to
`dist.all_gather`, which requires `world_size` tensors. All ranks deadlocked
waiting for the missing output slots.

**File**: `vllm/distributed/device_communicators/xpu_communicator.py`

```diff
- dist.all_gather([output_tensor], input_, group=self.device_group)
+ dist.all_gather_into_tensor(output_tensor, input_, group=self.device_group)
```
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Is this change risky for current issue? double check it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is low risk and safe. Added detailed risk analysis in 6311a97.

The original dist.all_gather([output_tensor], input_) is an API misuse — it passes 1 output tensor but dist.all_gather expects world_size tensors. When world_size > 1, this deadlocks.

dist.all_gather_into_tensor(output_tensor, input_) is the correct API for gathering into a single contiguous buffer. The same API is already used in gather() on line 178 of the same file, and in base_device_communicator.py line 198. The output tensor allocation at line 131 (input_size[0] * world_size) matches what all_gather_into_tensor expects.

This only affects the uniform path (when sizes is None, i.e., DP padding active). The variable-size path at line 147 is unchanged.


**Risk assessment**: This change is **low risk and safe** for the current issue.

- `dist.all_gather(tensor_list, input_)` expects `tensor_list` to be a list of
`world_size` pre-allocated tensors. The original code passes `[output_tensor]`
(1 element). When `world_size > 1`, this is an API misuse that causes a
deadlock — ranks wait forever for output slots that don't exist.
- `dist.all_gather_into_tensor(output_tensor, input_)` is the correct API for
gathering into a single contiguous tensor. It expects `output_tensor` to have
`world_size * input_size[0]` rows, which matches how `output_tensor` is
allocated at line 131: `output_size = (input_size[0] * world_size,) + input_size[1:]`.
- The same API (`all_gather_into_tensor`) is already used in the `gather()`
method of the same file (line 178) and in `base_device_communicator.py`
(line 198), confirming this is the standard pattern in vLLM.
- This fix only affects the **uniform path** (all ranks have equal tensor sizes,
i.e., `sizes is None`), which is the path used when DP padding is active
(Fix 1). The variable-size path (line 147) remains unchanged.

### Fix 5 — Eliminate sequential all_gatherv calls in list path

**Status**: ✅ APPLIED. This is a **production correctness fix**, not merely a
diagnostic change. Collapses N sequential `dist.all_gather_into_tensor` calls
(one per tensor) into a single call via int8 byte-view concatenation. This
eliminates call-order mismatch deadlocks when faster ranks submit collective #2
before slower ranks finish collective #1. Without this fix, any rank timing
skew within a MoE layer forward can cause a collective-type mismatch deadlock
on the list-path (non-uniform) all_gatherv.

**File**: `vllm/distributed/device_communicators/xpu_communicator.py`

### Fix 6 — Add `dist.barrier` before each collective in `all2all.py`

**Status**: ✅ APPLIED. Adds an XCCL barrier before each `all_gatherv` and
`reduce_scatterv` call in `AgRsAll2AllManager` to force all EP ranks to
rendezvous before submitting the collective. This eliminates the round 2
deadlock caused by rank 2 being slower than ranks 0,1,3 at the GPU-side
routing computation (softmax/topk) between rounds 1→2.

**File**: `vllm/distributed/device_communicators/all2all.py`

```diff
+ dist.barrier(group=dist_group.device_group)
gathered_tensors = dist_group.all_gatherv( # dispatch_router_logits
+ dist.barrier(group=dist_group.device_group)
gathered_tensors = dist_group.all_gatherv( # dispatch
+ dist.barrier(group=dist_group.device_group)
hidden_states = dist_group.reduce_scatterv( # combine
```

**Why `dist_group.device_group`**: `GroupCoordinator.barrier()` uses a CPU-level
group only. `dist.barrier(group=dist_group.device_group)` issues an XCCL
barrier that drains any in-flight GPU kernels (routing softmax/topk) before
the collective is submitted, ensuring all ranks reach the collective
call-site together.

---

## Patch Files

All fix patches are available in the `patches/` directory:

| Patch | Description |
|-------|-------------|
| `patches/fix1_dp_padding_for_ep.patch` | Force DP padding when EP is enabled |
| `patches/fix2_pad_attn_for_dp_padding.patch` | Align `num_actual_tokens` with padded count |
| `patches/fix3_disable_async_sched_ep_dp.patch` | Disable async scheduling for EP+DP |
| `patches/fix4_all_gatherv_uniform_path.patch` | Use `all_gather_into_tensor` for uniform path |
| `patches/fix6_barrier_before_collectives.patch` | Add XCCL barrier before MoE collectives |

Apply all patches:
```bash
git apply patches/fix*.patch
```

---

## Current Status (after all 6 fixes)

### Hang resolved — inference now completes

After applying all 6 fixes, the silent deadlock is eliminated. All 4 ranks
complete all MoE layers and the inference loop finishes. The `dist.barrier`
calls in Fix 6 prevent the rank-skew collective ordering deadlock that was the
last hang symptom.

### New symptom — incorrect output ("!!!!")

With all 6 fixes applied, inference completes but generates wrong output: every
prompt produces a long sequence of `"!"` characters regardless of input.

Example output:
```
[ARC B60] DP rank 0, Prompt: 'Hello, my name is'
Generated: '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
[ARC B60] DP rank 0, Prompt: 'The capital of France is'
Generated: '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
```

All prompts, all DP ranks, all iterations produce the same degenerate output.

---

## Wrong Output Analysis

### Why Fix 2 does NOT directly cause "!!!!" output

Fix 2 sets `pad_attn=True` when DP padding increases the token count, aligning
`num_actual_tokens` with the padded tensor row count (e.g., 30 instead of 26).
This causes the GDN attention kernel to process all 30 rows — including the 26
padding positions whose query vectors contain uninitialized (garbage) data.

However, Fix 2 **cannot** be the primary cause of "!!!!" through attention
corruption because of `logits_indices`:

```python
# gpu_model_runner.py — sampling step
sample_hidden_states = hidden_states[logits_indices]
```

`logits_indices` contains only the real token positions (e.g., `[0, 1, 2, 3]`
for 4 decode requests). Even if GDN writes garbage attention outputs to
`hidden_states[4:30]` for the padding rows, the final logit computation uses
only `hidden_states[0:3]` — the correct positions. Garbage at positions 4–29
is never read by the sampler.

Similarly, inside the MoE layer, each token's expert output is computed
independently (no cross-token interactions within a single expert forward).
Garbage routing for positions 4–29 does not overwrite positions 0–3.

**Fix 2 is necessary and correct.** The `num_actual_tokens` alignment is
required to prevent the GDN kernel size-check assertion failure that caused
the original hang.

---

### Fix 5 int8 byte-view — COMPLETELY RULED OUT

All XPU type punning round-trip tests pass:

```
# float16 → int8 → float16: PASSES
# float32 → int8 → float32: PASSES
# int32 → int8 → int32: PASSES
```

Fix 5 (in `xpu_communicator.py`) reduces N sequential XCCL collectives to ONE
by converting all tensors to int8 byte-view, concatenating, gathering once,
then splitting back. The round-trip tests confirm it correctly preserves bytes
for all dtypes used in MoE collectives (`hidden_states` float16,
`topk_weights` float16/float32, `topk_ids` int32). Fix 5 is **not** the
source of the "!!!!" output.

---

### New hypothesis: `sizes` mismatch between dp_metadata and padded tensor

The `dispatch` and `combine` functions in `AgRsAll2AllManager` both call
`dp_metadata.get_chunk_sizes_across_dp_rank()` to get `sizes`. Under MoE
sequence parallelism (SP), `sizes` is computed via:

```python
# forward_context.py — DPMetadata.sp_local_sizes(sp_size)
sp_tokens = (num_tokens_across_dp_cpu + sp_size - 1) // sp_size
sp_tokens = sp_tokens.repeat_interleave(sp_size)
```

With TP=2 (used as SP=2 for MoE) and `num_tokens_across_dp_cpu = [26, 30]`
(unpadded, if Fix 1 has not propagated to dp_metadata):

```
sizes = [ceil(26/2), ceil(26/2), ceil(30/2), ceil(30/2)] = [13, 13, 15, 15]
```

After Fix 2 (DP padding), hidden_states on dp_rank 0 has 30 rows. After SP
split (÷2), `hidden_states.shape[0] = 15`. But `sizes[ep_rank=0] = 13`.
The `dispatch` assertion `sizes[ep_rank] == hidden_states.shape[0]` compares
`13 != 15` → **AssertionError**, or if the assertion is absent, the
variable-size path is used with wrong slice boundaries → data corruption.

With `num_tokens_across_dp_cpu = [30, 30]` (padded, Fix 1 fully effective):

```
sizes = [15, 15, 15, 15] (uniform → sizes=None in all_gatherv)
```

**Key question**: Does Fix 1 correctly update `dp_metadata.num_tokens_across_dp_cpu`
to the padded values before the MoE forward? The [TRACE] logs already emitted
by the code will answer this directly.

---

### Recommended next steps

1. **Read the [TRACE] logs** — they are already emitted by the current code:

```
[TRACE] rank=N dispatch ENTER all_gatherv: sizes=[...], tensor_shapes=[...]
[TRACE] rank=N combine ENTER reduce_scatterv: sizes=[...], hidden_states_shape=[...]
```

- If `sizes` is uniform (e.g., `[30, 30]` for DP=2, SP=1), the collectives
use `all_gather_into_tensor` and `reduce_scatter_tensor` (uniform path) ✓
- If `sizes` is non-uniform (e.g., `[26, 30]`), an assertion will fire OR
the variable-size path is taken with mismatched tensor shapes → data corruption

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot 在这个pr里生成一个commit来实现这个trace log

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the trace logging infrastructure in commit a365c42. Added [TRACE] and [COUNTER] debug logs to all 5 files listed in the Tracing Infrastructure section:

  • vllm/_xpu_ops.py — ENTER/EXIT + core_attn_out.size(0) vs num_actual_tokens match check
  • vllm/v1/worker/gpu_model_runner.pyexecute_model and sample_tokens with dp=
  • vllm/v1/worker/dp_utils.py — ENTER/EXIT around dist.all_reduce in _run_ar
  • vllm/distributed/device_communicators/xpu_communicator.py — COUNTER probes with seq=N and uniform/variable-size path
  • vllm/distributed/device_communicators/all2all.py — ENTER/EXIT around dispatch_router_logits, dispatch, combine with sizes and tensor shapes

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot use the print to replace logger.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced all logger.debug with print(..., flush=True) for all [TRACE] and [COUNTER] trace logs in 42a4710.

2. **Check for SP (sequence parallelism)**: If TP is used as SP for MoE
(sp_size > 1), `sizes` will have `dp_size * sp_size` entries (e.g., 4 for
TP=2, DP=2). Verify that `sizes[ep_rank] == hidden_states.shape[0]` holds.

3. **If sizes are correct (uniform/matching)**: The "!!!!" must originate from
within the model forward itself. Candidates:
- Padding tokens (rows 26–29) with garbage query vectors produce large
attention weights that corrupt real-token KV cache entries via attention
(GDN attention output at positions 0–25 may be affected if the padded
queries have extreme values)
- Shared experts receiving padded input: if Qwen3-MoE shared experts run
on the full padded tensor [30, d], their output for positions 26–29 is
garbage. If those positions' shared-expert output is added to the sparse
expert output via reduce_scatter, the sum may incorrectly mix garbage
with real-token results
- Zero out the padding positions before the router to test:
```python
# In gpu_model_runner.py, after DP padding is applied:
if dp_padding_applied:
hidden_states[num_tokens_unpadded:] = 0
```
If "!!!!" disappears, padding garbage values are corrupting the MoE router.

---

## Tracing Infrastructure

### Files modified

| File | Changes |
|------|---------|
| `vllm/_xpu_ops.py` | ENTER/EXIT around `gdn_attention` kernel; match check for `core_attn_out.size(0)` vs `num_actual_tokens` |
| `vllm/v1/worker/gpu_model_runner.py` | `execute_model` and `sample_tokens` traces with `dp=` and `iter=`; **Fix 2**; **Fix 3** |
| `vllm/v1/worker/dp_utils.py` | **Fix 1**; `_run_ar` deadlock risk checker (iter count mismatch warning); ENTER/EXIT around `dist.all_reduce` |
| `vllm/distributed/device_communicators/xpu_communicator.py` | **Fix 4**; **Fix 5**; COUNTER probes around `reduce_scatterv` and `all_gatherv` with seq number |
| `vllm/distributed/device_communicators/all2all.py` | **Fix 6**; ENTER/EXIT around MoE `dispatch_router_logits`, `dispatch`, and `combine` |

### How to read COUNTER logs

```
[COUNTER] rank=X seq=N all_gatherv/uniform counter=1 ← before collective
[COUNTER] rank=X seq=N all_gatherv/uniform counter=0 ← after collective (success)
```

- `counter=1` with no following `0` identifies the hanging collective.
- `seq=N` is a global call sequence number; compare across ranks to detect ordering mismatches.
- `uniform` = all ranks have the same tensor size (DP padding active); `variable-size` = sizes differ.

### DP communicator structure (TP=2, DP=2)

With TP=2, DP=2, vLLM creates two independent DP communicator groups:
- **Group A**: `{RANK=0 (dp=0,tp=0), RANK=2 (dp=1,tp=0)}` — tp=0 processes
- **Group B**: `{RANK=1 (dp=0,tp=1), RANK=3 (dp=1,tp=1)}` — tp=1 processes

Each group runs an independent `dist.all_reduce` per iteration in `_run_ar`.
Seeing two ENTER/EXIT pairs per dp_rank per iteration is normal.
11 changes: 11 additions & 0 deletions patches/fix1_dp_padding_for_ep.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--- a/vllm/v1/worker/dp_utils.py
+++ b/vllm/v1/worker/dp_utils.py
@@ -147,7 +147,8 @@
# Use the synced runtime cudagraph mode rather than the compilation config
# so we can avoid padding when cudagraph is not enabled for this step.
- should_dp_pad = synced_cudagraph_mode != 0 or should_ubatch
+ should_dp_pad = (synced_cudagraph_mode != 0 or should_ubatch
+ or parallel_config.enable_expert_parallel)

# Pad all DP ranks up to the maximum token count across ranks if
# should_dp_pad is True
12 changes: 12 additions & 0 deletions patches/fix2_pad_attn_for_dp_padding.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
--- a/vllm/v1/worker/gpu_model_runner.py
+++ b/vllm/v1/worker/gpu_model_runner.py
@@ -3978,7 +3978,9 @@
if not isinstance(spec.kv_cache_spec, EncoderOnlyAttentionSpec)
)
- pad_attn = cudagraph_mode == CUDAGraphMode.FULL
+ dp_padding_applied = num_tokens_padded > num_tokens_unpadded
+ pad_attn = (cudagraph_mode == CUDAGraphMode.FULL
+ or dp_padding_applied)

if self.cache_config.mamba_cache_mode == "align":
# preprocess_mamba reads req_state.num_computed_tokens (CPU)
13 changes: 13 additions & 0 deletions patches/fix3_disable_async_sched_ep_dp.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
--- a/vllm/v1/worker/gpu_model_runner.py
+++ b/vllm/v1/worker/gpu_model_runner.py
@@ -480,6 +480,10 @@
# Async scheduling
self.use_async_scheduling = self.scheduler_config.async_scheduling

+ if (self.use_async_scheduling
+ and self.parallel_config.enable_expert_parallel
+ and self.parallel_config.data_parallel_size > 1):
+ self.use_async_scheduling = False
+
# Sampler
self.sampler = Sampler(logprobs_mode=self.model_config.logprobs_mode)
Loading