-
Notifications
You must be signed in to change notification settings - Fork 0
Add debug.md, apply EP hang fixes (1-3, 6), add pre/post-attention NaN trace with first-occurrence print-based ERROR output and padding/actual row distinction, fix attention output buffer init in both shared and model-specific layers, add DP+EP example... #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
31a85e5
7952367
a5e1f42
6311a97
a365c42
1c85e35
42a4710
4427d09
47e4809
918c2c3
4abf98a
af5faaf
7ae30fe
5193eb7
74ded3f
0aedc6a
e3c9947
ad166b5
cf87e89
bc2016b
b6e3f24
761c52e
cfca5b6
8cd1920
6e731d6
df65fef
5b5141b
8c8358b
8af6a14
ecd2e53
30dbda7
06b02cf
de74bd8
da8225d
ee05974
5161e1f
4d1a206
0c9152e
997d799
f7543d1
ff51398
2e13be2
1cfb966
7ac0d34
54a7f5d
958cd83
66e172d
ae10fa5
26a974b
b505a63
77dfcc1
a9db001
934b9a4
9dce74b
5ad477e
f7dffcb
5ea0eba
b2077e1
a2f331e
6ae7ccd
2dbc808
e1ee9d1
e2d993c
1a42b2c
6be0b13
7162b59
3755876
44170eb
bc304ee
681c61a
f9082e3
1ea756b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,359 @@ | ||
| # XPU EP Hang Diagnosis - Debug Summary | ||
|
|
||
| ## Problem Statement | ||
|
|
||
| vLLM with Expert Parallelism (EP) on XPU hangs during inference when using | ||
| Data Parallelism (DP). The hang manifests as a silent deadlock — the process | ||
| stops producing output with no error message. | ||
|
|
||
| **Config**: Qwen3.5-35B-A3B, TP=2, EP=4 (MoE dispatch/combine over XCCL), | ||
| DP=2 with DP padding enabled. | ||
|
|
||
| --- | ||
|
|
||
| ## Confirmed Fixes | ||
|
|
||
| ### Fix 1 — Force DP padding when Expert Parallelism is enabled | ||
|
|
||
| **Status**: ✅ CONFIRMED NEEDED and applied. All COUNTER logs show | ||
| `all_gatherv/uniform` (uniform = equal-size tensors across ranks), confirming | ||
| DP padding is in effect. | ||
|
|
||
| **File**: `vllm/v1/worker/dp_utils.py` | ||
|
|
||
| **Root cause**: Without DP padding, each DP rank processes a different number | ||
| of tokens. XCCL MoE dispatch/combine collectives require equal-size tensors. | ||
| Forcing DP padding when EP is active ensures all ranks always have the same | ||
| token count. | ||
|
|
||
| ```diff | ||
| - should_dp_pad = synced_cudagraph_mode != 0 or should_ubatch | ||
| + should_dp_pad = (synced_cudagraph_mode != 0 or should_ubatch | ||
| + or parallel_config.enable_expert_parallel) | ||
| ``` | ||
|
|
||
| ### Fix 2 — `num_actual_tokens` mismatch when DP padding is active | ||
|
|
||
| **Status**: ✅ CONFIRMED FIXED by log evidence. | ||
|
|
||
| **File**: `vllm/v1/worker/gpu_model_runner.py` | ||
|
|
||
| **Log evidence** (before fix — rank 1 mismatch): | ||
| ``` | ||
| [TRACE] _gdn_attention_core_xpu_impl: core_attn_out.size(0)=30, num_actual_tokens=26, match=False | ||
| ``` | ||
|
|
||
| **After fix** — all ranks show `match=True`. | ||
|
|
||
| **Root cause**: DP padding pads `hidden_states` to the max token count across | ||
| DP ranks (30), but `num_actual_tokens` in attention metadata remained at the | ||
| real count (26). The XPU GDN kernel asserts | ||
| `core_attn_out.size(0) == num_actual_tokens` and hangs. The fix sets | ||
| `pad_attn=True` whenever DP padding is applied, aligning `num_actual_tokens`, | ||
| slot mappings, and attention metadata with the padded count. | ||
|
|
||
| ```diff | ||
| - pad_attn = cudagraph_mode == CUDAGraphMode.FULL | ||
| + dp_padding_applied = num_tokens_padded > num_tokens_unpadded | ||
| + pad_attn = cudagraph_mode == CUDAGraphMode.FULL or dp_padding_applied | ||
| ``` | ||
|
|
||
| ### Fix 3 — Disable async scheduling when EP + DP is active | ||
|
|
||
| **Status**: ✅ APPLIED. This is a **production correctness fix**, not merely a | ||
| diagnostic aid. | ||
|
|
||
| **File**: `vllm/v1/worker/gpu_model_runner.py` | ||
|
|
||
| **Root cause (production)**: With async scheduling enabled and EP+DP active, | ||
| `AsyncGPUModelRunnerOutput` returns immediately after queuing the GPU→CPU | ||
| copy. If DP ranks advance their schedulers at different speeds, one DP rank | ||
| can enter the next iteration's `_run_ar` all-reduce before the other finishes | ||
| the current iteration's GPU work, causing a cross-iteration collective | ||
| mismatch deadlock. | ||
|
|
||
| **Diagnostic benefit**: With async scheduling disabled, GPU-side hangs inside | ||
| the MoE forward become visible inside `sample_tokens: bookkeeping` rather than | ||
| hiding behind the async copy queue. This confirmed the hang is GPU-side (not a | ||
| CPU/scheduler race) and narrowed it to the model forward pass. | ||
|
|
||
| ```diff | ||
| + if (self.use_async_scheduling | ||
| + and self.parallel_config.enable_expert_parallel | ||
| + and self.parallel_config.data_parallel_size > 1): | ||
| + self.use_async_scheduling = False | ||
| ``` | ||
|
|
||
| ### Fix 4 — Correct `all_gatherv` uniform path | ||
|
|
||
| **Status**: ✅ CONFIRMED NEEDED. The original code passed a 1-element list to | ||
| `dist.all_gather`, which requires `world_size` tensors. All ranks deadlocked | ||
| waiting for the missing output slots. | ||
|
|
||
| **File**: `vllm/distributed/device_communicators/xpu_communicator.py` | ||
|
|
||
| ```diff | ||
| - dist.all_gather([output_tensor], input_, group=self.device_group) | ||
| + dist.all_gather_into_tensor(output_tensor, input_, group=self.device_group) | ||
| ``` | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @copilot Is this change risky for current issue? double check it.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change is low risk and safe. Added detailed risk analysis in The original
This only affects the uniform path (when |
||
|
|
||
| **Risk assessment**: This change is **low risk and safe** for the current issue. | ||
|
|
||
| - `dist.all_gather(tensor_list, input_)` expects `tensor_list` to be a list of | ||
| `world_size` pre-allocated tensors. The original code passes `[output_tensor]` | ||
| (1 element). When `world_size > 1`, this is an API misuse that causes a | ||
| deadlock — ranks wait forever for output slots that don't exist. | ||
| - `dist.all_gather_into_tensor(output_tensor, input_)` is the correct API for | ||
| gathering into a single contiguous tensor. It expects `output_tensor` to have | ||
| `world_size * input_size[0]` rows, which matches how `output_tensor` is | ||
| allocated at line 131: `output_size = (input_size[0] * world_size,) + input_size[1:]`. | ||
| - The same API (`all_gather_into_tensor`) is already used in the `gather()` | ||
| method of the same file (line 178) and in `base_device_communicator.py` | ||
| (line 198), confirming this is the standard pattern in vLLM. | ||
| - This fix only affects the **uniform path** (all ranks have equal tensor sizes, | ||
| i.e., `sizes is None`), which is the path used when DP padding is active | ||
| (Fix 1). The variable-size path (line 147) remains unchanged. | ||
|
|
||
| ### Fix 5 — Eliminate sequential all_gatherv calls in list path | ||
|
|
||
| **Status**: ✅ APPLIED. This is a **production correctness fix**, not merely a | ||
| diagnostic change. Collapses N sequential `dist.all_gather_into_tensor` calls | ||
| (one per tensor) into a single call via int8 byte-view concatenation. This | ||
| eliminates call-order mismatch deadlocks when faster ranks submit collective #2 | ||
| before slower ranks finish collective #1. Without this fix, any rank timing | ||
| skew within a MoE layer forward can cause a collective-type mismatch deadlock | ||
| on the list-path (non-uniform) all_gatherv. | ||
|
|
||
| **File**: `vllm/distributed/device_communicators/xpu_communicator.py` | ||
|
|
||
| ### Fix 6 — Add `dist.barrier` before each collective in `all2all.py` | ||
|
|
||
| **Status**: ✅ APPLIED. Adds an XCCL barrier before each `all_gatherv` and | ||
| `reduce_scatterv` call in `AgRsAll2AllManager` to force all EP ranks to | ||
| rendezvous before submitting the collective. This eliminates the round 2 | ||
| deadlock caused by rank 2 being slower than ranks 0,1,3 at the GPU-side | ||
| routing computation (softmax/topk) between rounds 1→2. | ||
|
|
||
| **File**: `vllm/distributed/device_communicators/all2all.py` | ||
|
|
||
| ```diff | ||
| + dist.barrier(group=dist_group.device_group) | ||
| gathered_tensors = dist_group.all_gatherv( # dispatch_router_logits | ||
| + dist.barrier(group=dist_group.device_group) | ||
| gathered_tensors = dist_group.all_gatherv( # dispatch | ||
| + dist.barrier(group=dist_group.device_group) | ||
| hidden_states = dist_group.reduce_scatterv( # combine | ||
| ``` | ||
|
|
||
| **Why `dist_group.device_group`**: `GroupCoordinator.barrier()` uses a CPU-level | ||
| group only. `dist.barrier(group=dist_group.device_group)` issues an XCCL | ||
| barrier that drains any in-flight GPU kernels (routing softmax/topk) before | ||
| the collective is submitted, ensuring all ranks reach the collective | ||
| call-site together. | ||
|
|
||
| --- | ||
|
|
||
| ## Patch Files | ||
|
|
||
| All fix patches are available in the `patches/` directory: | ||
|
|
||
| | Patch | Description | | ||
| |-------|-------------| | ||
| | `patches/fix1_dp_padding_for_ep.patch` | Force DP padding when EP is enabled | | ||
| | `patches/fix2_pad_attn_for_dp_padding.patch` | Align `num_actual_tokens` with padded count | | ||
| | `patches/fix3_disable_async_sched_ep_dp.patch` | Disable async scheduling for EP+DP | | ||
| | `patches/fix4_all_gatherv_uniform_path.patch` | Use `all_gather_into_tensor` for uniform path | | ||
| | `patches/fix6_barrier_before_collectives.patch` | Add XCCL barrier before MoE collectives | | ||
|
|
||
| Apply all patches: | ||
| ```bash | ||
| git apply patches/fix*.patch | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Current Status (after all 6 fixes) | ||
|
|
||
| ### Hang resolved — inference now completes | ||
|
|
||
| After applying all 6 fixes, the silent deadlock is eliminated. All 4 ranks | ||
| complete all MoE layers and the inference loop finishes. The `dist.barrier` | ||
| calls in Fix 6 prevent the rank-skew collective ordering deadlock that was the | ||
| last hang symptom. | ||
|
|
||
| ### New symptom — incorrect output ("!!!!") | ||
|
|
||
| With all 6 fixes applied, inference completes but generates wrong output: every | ||
| prompt produces a long sequence of `"!"` characters regardless of input. | ||
|
|
||
| Example output: | ||
| ``` | ||
| [ARC B60] DP rank 0, Prompt: 'Hello, my name is' | ||
| Generated: '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!' | ||
| [ARC B60] DP rank 0, Prompt: 'The capital of France is' | ||
| Generated: '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!' | ||
| ``` | ||
|
|
||
| All prompts, all DP ranks, all iterations produce the same degenerate output. | ||
|
|
||
| --- | ||
|
|
||
| ## Wrong Output Analysis | ||
|
|
||
| ### Why Fix 2 does NOT directly cause "!!!!" output | ||
|
|
||
| Fix 2 sets `pad_attn=True` when DP padding increases the token count, aligning | ||
| `num_actual_tokens` with the padded tensor row count (e.g., 30 instead of 26). | ||
| This causes the GDN attention kernel to process all 30 rows — including the 26 | ||
| padding positions whose query vectors contain uninitialized (garbage) data. | ||
|
|
||
| However, Fix 2 **cannot** be the primary cause of "!!!!" through attention | ||
| corruption because of `logits_indices`: | ||
|
|
||
| ```python | ||
| # gpu_model_runner.py — sampling step | ||
| sample_hidden_states = hidden_states[logits_indices] | ||
| ``` | ||
|
|
||
| `logits_indices` contains only the real token positions (e.g., `[0, 1, 2, 3]` | ||
| for 4 decode requests). Even if GDN writes garbage attention outputs to | ||
| `hidden_states[4:30]` for the padding rows, the final logit computation uses | ||
| only `hidden_states[0:3]` — the correct positions. Garbage at positions 4–29 | ||
| is never read by the sampler. | ||
|
|
||
| Similarly, inside the MoE layer, each token's expert output is computed | ||
| independently (no cross-token interactions within a single expert forward). | ||
| Garbage routing for positions 4–29 does not overwrite positions 0–3. | ||
|
|
||
| **Fix 2 is necessary and correct.** The `num_actual_tokens` alignment is | ||
| required to prevent the GDN kernel size-check assertion failure that caused | ||
| the original hang. | ||
|
|
||
| --- | ||
|
|
||
| ### Fix 5 int8 byte-view — COMPLETELY RULED OUT | ||
|
|
||
| All XPU type punning round-trip tests pass: | ||
|
|
||
| ``` | ||
| # float16 → int8 → float16: PASSES | ||
| # float32 → int8 → float32: PASSES | ||
| # int32 → int8 → int32: PASSES | ||
| ``` | ||
|
|
||
| Fix 5 (in `xpu_communicator.py`) reduces N sequential XCCL collectives to ONE | ||
| by converting all tensors to int8 byte-view, concatenating, gathering once, | ||
| then splitting back. The round-trip tests confirm it correctly preserves bytes | ||
| for all dtypes used in MoE collectives (`hidden_states` float16, | ||
| `topk_weights` float16/float32, `topk_ids` int32). Fix 5 is **not** the | ||
| source of the "!!!!" output. | ||
|
|
||
| --- | ||
|
|
||
| ### New hypothesis: `sizes` mismatch between dp_metadata and padded tensor | ||
|
|
||
| The `dispatch` and `combine` functions in `AgRsAll2AllManager` both call | ||
| `dp_metadata.get_chunk_sizes_across_dp_rank()` to get `sizes`. Under MoE | ||
| sequence parallelism (SP), `sizes` is computed via: | ||
|
|
||
| ```python | ||
| # forward_context.py — DPMetadata.sp_local_sizes(sp_size) | ||
| sp_tokens = (num_tokens_across_dp_cpu + sp_size - 1) // sp_size | ||
| sp_tokens = sp_tokens.repeat_interleave(sp_size) | ||
| ``` | ||
|
|
||
| With TP=2 (used as SP=2 for MoE) and `num_tokens_across_dp_cpu = [26, 30]` | ||
| (unpadded, if Fix 1 has not propagated to dp_metadata): | ||
|
|
||
| ``` | ||
| sizes = [ceil(26/2), ceil(26/2), ceil(30/2), ceil(30/2)] = [13, 13, 15, 15] | ||
| ``` | ||
|
|
||
| After Fix 2 (DP padding), hidden_states on dp_rank 0 has 30 rows. After SP | ||
| split (÷2), `hidden_states.shape[0] = 15`. But `sizes[ep_rank=0] = 13`. | ||
| The `dispatch` assertion `sizes[ep_rank] == hidden_states.shape[0]` compares | ||
| `13 != 15` → **AssertionError**, or if the assertion is absent, the | ||
| variable-size path is used with wrong slice boundaries → data corruption. | ||
|
|
||
| With `num_tokens_across_dp_cpu = [30, 30]` (padded, Fix 1 fully effective): | ||
|
|
||
| ``` | ||
| sizes = [15, 15, 15, 15] (uniform → sizes=None in all_gatherv) | ||
| ``` | ||
|
|
||
| **Key question**: Does Fix 1 correctly update `dp_metadata.num_tokens_across_dp_cpu` | ||
| to the padded values before the MoE forward? The [TRACE] logs already emitted | ||
| by the code will answer this directly. | ||
|
|
||
| --- | ||
|
|
||
| ### Recommended next steps | ||
|
|
||
| 1. **Read the [TRACE] logs** — they are already emitted by the current code: | ||
|
|
||
| ``` | ||
| [TRACE] rank=N dispatch ENTER all_gatherv: sizes=[...], tensor_shapes=[...] | ||
| [TRACE] rank=N combine ENTER reduce_scatterv: sizes=[...], hidden_states_shape=[...] | ||
| ``` | ||
|
|
||
| - If `sizes` is uniform (e.g., `[30, 30]` for DP=2, SP=1), the collectives | ||
| use `all_gather_into_tensor` and `reduce_scatter_tensor` (uniform path) ✓ | ||
| - If `sizes` is non-uniform (e.g., `[26, 30]`), an assertion will fire OR | ||
| the variable-size path is taken with mismatched tensor shapes → data corruption | ||
|
|
||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @copilot 在这个pr里生成一个commit来实现这个trace log
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Implemented the trace logging infrastructure in commit
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @copilot use the print to replace logger.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Replaced all |
||
| 2. **Check for SP (sequence parallelism)**: If TP is used as SP for MoE | ||
| (sp_size > 1), `sizes` will have `dp_size * sp_size` entries (e.g., 4 for | ||
| TP=2, DP=2). Verify that `sizes[ep_rank] == hidden_states.shape[0]` holds. | ||
|
|
||
| 3. **If sizes are correct (uniform/matching)**: The "!!!!" must originate from | ||
| within the model forward itself. Candidates: | ||
| - Padding tokens (rows 26–29) with garbage query vectors produce large | ||
| attention weights that corrupt real-token KV cache entries via attention | ||
| (GDN attention output at positions 0–25 may be affected if the padded | ||
| queries have extreme values) | ||
| - Shared experts receiving padded input: if Qwen3-MoE shared experts run | ||
| on the full padded tensor [30, d], their output for positions 26–29 is | ||
| garbage. If those positions' shared-expert output is added to the sparse | ||
| expert output via reduce_scatter, the sum may incorrectly mix garbage | ||
| with real-token results | ||
| - Zero out the padding positions before the router to test: | ||
| ```python | ||
| # In gpu_model_runner.py, after DP padding is applied: | ||
| if dp_padding_applied: | ||
| hidden_states[num_tokens_unpadded:] = 0 | ||
| ``` | ||
| If "!!!!" disappears, padding garbage values are corrupting the MoE router. | ||
|
|
||
| --- | ||
|
|
||
| ## Tracing Infrastructure | ||
|
|
||
| ### Files modified | ||
|
|
||
| | File | Changes | | ||
| |------|---------| | ||
| | `vllm/_xpu_ops.py` | ENTER/EXIT around `gdn_attention` kernel; match check for `core_attn_out.size(0)` vs `num_actual_tokens` | | ||
| | `vllm/v1/worker/gpu_model_runner.py` | `execute_model` and `sample_tokens` traces with `dp=` and `iter=`; **Fix 2**; **Fix 3** | | ||
| | `vllm/v1/worker/dp_utils.py` | **Fix 1**; `_run_ar` deadlock risk checker (iter count mismatch warning); ENTER/EXIT around `dist.all_reduce` | | ||
| | `vllm/distributed/device_communicators/xpu_communicator.py` | **Fix 4**; **Fix 5**; COUNTER probes around `reduce_scatterv` and `all_gatherv` with seq number | | ||
| | `vllm/distributed/device_communicators/all2all.py` | **Fix 6**; ENTER/EXIT around MoE `dispatch_router_logits`, `dispatch`, and `combine` | | ||
|
|
||
| ### How to read COUNTER logs | ||
|
|
||
| ``` | ||
| [COUNTER] rank=X seq=N all_gatherv/uniform counter=1 ← before collective | ||
| [COUNTER] rank=X seq=N all_gatherv/uniform counter=0 ← after collective (success) | ||
| ``` | ||
|
|
||
| - `counter=1` with no following `0` identifies the hanging collective. | ||
| - `seq=N` is a global call sequence number; compare across ranks to detect ordering mismatches. | ||
| - `uniform` = all ranks have the same tensor size (DP padding active); `variable-size` = sizes differ. | ||
|
|
||
| ### DP communicator structure (TP=2, DP=2) | ||
|
|
||
| With TP=2, DP=2, vLLM creates two independent DP communicator groups: | ||
| - **Group A**: `{RANK=0 (dp=0,tp=0), RANK=2 (dp=1,tp=0)}` — tp=0 processes | ||
| - **Group B**: `{RANK=1 (dp=0,tp=1), RANK=3 (dp=1,tp=1)}` — tp=1 processes | ||
|
|
||
| Each group runs an independent `dist.all_reduce` per iteration in `_run_ar`. | ||
| Seeing two ENTER/EXIT pairs per dp_rank per iteration is normal. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| --- a/vllm/v1/worker/dp_utils.py | ||
| +++ b/vllm/v1/worker/dp_utils.py | ||
| @@ -147,7 +147,8 @@ | ||
| # Use the synced runtime cudagraph mode rather than the compilation config | ||
| # so we can avoid padding when cudagraph is not enabled for this step. | ||
| - should_dp_pad = synced_cudagraph_mode != 0 or should_ubatch | ||
| + should_dp_pad = (synced_cudagraph_mode != 0 or should_ubatch | ||
| + or parallel_config.enable_expert_parallel) | ||
|
|
||
| # Pad all DP ranks up to the maximum token count across ranks if | ||
| # should_dp_pad is True |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| --- a/vllm/v1/worker/gpu_model_runner.py | ||
| +++ b/vllm/v1/worker/gpu_model_runner.py | ||
| @@ -3978,7 +3978,9 @@ | ||
| if not isinstance(spec.kv_cache_spec, EncoderOnlyAttentionSpec) | ||
| ) | ||
| - pad_attn = cudagraph_mode == CUDAGraphMode.FULL | ||
| + dp_padding_applied = num_tokens_padded > num_tokens_unpadded | ||
| + pad_attn = (cudagraph_mode == CUDAGraphMode.FULL | ||
| + or dp_padding_applied) | ||
|
|
||
| if self.cache_config.mamba_cache_mode == "align": | ||
| # preprocess_mamba reads req_state.num_computed_tokens (CPU) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| --- a/vllm/v1/worker/gpu_model_runner.py | ||
| +++ b/vllm/v1/worker/gpu_model_runner.py | ||
| @@ -480,6 +480,10 @@ | ||
| # Async scheduling | ||
| self.use_async_scheduling = self.scheduler_config.async_scheduling | ||
|
|
||
| + if (self.use_async_scheduling | ||
| + and self.parallel_config.enable_expert_parallel | ||
| + and self.parallel_config.data_parallel_size > 1): | ||
| + self.use_async_scheduling = False | ||
| + | ||
| # Sampler | ||
| self.sampler = Sampler(logprobs_mode=self.model_config.logprobs_mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot generate the patch with these fix descrition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generated patch files for all fixes in
patches/directory (commit6311a97). Each fix has its own.patchfile that can be applied withgit apply patches/fix*.patch. Fix 5 patch is omitted as it requires more complex refactoring of the list-path logic inxpu_communicator.py.