[Bug]: cudaErrorIllegalAddress crash when enabling `--performance-mode throughput` for zai-org/GLM-4.7-FP8 under load

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Collecting environment information...
==============================
        System Info
==============================
OS                           : Amazon Linux 2023.10.20260216 (x86_64)
GCC version                  : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version                : Could not collect
CMake version                : version 3.22.2
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar 10 2026, 18:17:25) [Clang 21.1.4 ] (64-bit runtime)
Python platform              : Linux-6.1.161-183.298.amzn2023.x86_64-x86_64-with-glibc2.34

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200

Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           48 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 7R13 Processor
CPU family:                              25
Model:                                   1
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                1
BogoMIPS:                                5300.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               3 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                48 MiB (96 instances)
L3 cache:                                384 MiB (12 instances)
NUMA node(s):                            4
NUMA node0 CPU(s):                       0-23,96-119
NUMA node1 CPU(s):                       24-47,120-143
NUMA node2 CPU(s):                       48-71,144-167
NUMA node3 CPU(s):                       72-95,168-191
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Mitigation; Clear CPU buffers
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.3
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.0.0.19
[pip3] nvidia-cuda-cupti==13.0.48
[pip3] nvidia-cuda-nvrtc==13.0.48
[pip3] nvidia-cuda-runtime==13.0.48
[pip3] nvidia-cudnn-cu13==9.13.0.50
[pip3] nvidia-cudnn-frontend==1.20.0
[pip3] nvidia-cufft==12.0.0.15
[pip3] nvidia-cufile==1.15.0.42
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.3.29
[pip3] nvidia-cusparse==12.6.2.49
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu13==2.27.7
[pip3] nvidia-nvjitlink==13.0.39
[pip3] nvidia-nvshmem-cu13==3.3.24
[pip3] nvidia-nvtx==13.0.39
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.1+cu130
[pip3] torchaudio==2.9.1+cu130
[pip3] torchvision==0.24.1+cu130
[pip3] transformers==4.57.6
[pip3] triton==3.5.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.16.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	24-47,120-143	1		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	24-47,120-143	1		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	0-23,96-119	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	0-23,96-119	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	72-95,168-191	3		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	72-95,168-191	3		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	48-71,144-167	2		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	48-71,144-167	2		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
```

</details>


### 🐛 Describe the bug

**Description:**
I am experiencing a critical crash (`CUDA error: an illegal memory access was encountered`, cudaErrorIllegalAddress) when serving the zai-org/GLM-4.7-FP8 model with `--performance-mode throughput` even with the first request. 

The service runs perfectly fine in default and interactivity modes.

**Steps to Reproduce:**

1. Start the vLLM(0.17.1) server with the zai-org/GLM-4.7-FP8 model and the following speculative decoding configuration:

```
vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--async-scheduling \
--enable-prefix-caching \
--performance-mode throughput
```

Start benchmark with:

```
vllm bench serve \
--model zai-org/GLM-4.7-FP8 \
--port 8000 \
--save-result \
--save-detailed \
--backend=vllm \
--dataset-name custom \
--dataset-path SOME_DATASET \
--disable-shuffle \
--metric-percentiles "50,90,95,99" \
--percentile-metrics "ttft,tpot,e2el" \
--result-dir "./vllm_bench_results/" \
--request-rate 1
```

2. Send a request to the server.
3. The server will suddenly crash with a CUDA error during request processing.

<details>
<summary>Error:</summary>

```text
[rank3]:[E319 18:04:56.562835770 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fb0bdb72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fb0c15330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fb0065008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fb00650da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fb006511539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fb0065135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fb0916e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fb0cf48b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fb0cf510b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fb0bdb72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fb0c15330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fb0065008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fb00650da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fb006511539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fb0065135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fb0916e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fb0cf48b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fb0cf510b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fb0bdb72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7fb005c8c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7fb0916e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7fb0cf48b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7fb0cf510b40 in /lib64/libc.so.6)

(Worker pid=2448772) (Worker_TP0 pid=2448772) Exception in thread WorkerAsyncOutputCopy:
(Worker pid=2448772) (Worker_TP0 pid=2448772) Traceback (most recent call last):
(Worker pid=2448772) (Worker_TP0 pid=2448772)   File "/home/ssm-user/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
(Worker pid=2448772) (Worker_TP0 pid=2448772)     self.run()
(Worker pid=2448772) (Worker_TP0 pid=2448772)   File "/home/ssm-user/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012, in run
(Worker pid=2448772) (Worker_TP0 pid=2448772)     self._target(*self._args, **self._kwargs)
(Worker pid=2448772) (Worker_TP0 pid=2448772)   File "/home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 860, in async_output_busy_loop
(Worker pid=2448772) (Worker_TP0 pid=2448772)     self.enqueue_output(output)
(Worker pid=2448772) (Worker_TP0 pid=2448772)   File "/home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 837, in enqueue_output
[rank0]:[E319 18:04:56.570791183 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fad48172fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fad485780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fac8d1008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fac8d10da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fac8d111539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fac8d1135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fad182e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fad5628b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fad56310b40 in /lib64/libc.so.6)

terminate called after throwing an instance of '(Worker pid=2448772) (Worker_TP0 pid=2448772)     output = output.get_output()
c10::DistBackendError'
(Worker pid=2448772) (Worker_TP0 pid=2448772)              ^^^^^^^^^^^^^^^^^^^
(Worker pid=2448772) (Worker_TP0 pid=2448772)   File "/home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 251, in get_output
(Worker pid=2448772) (Worker_TP0 pid=2448772)     self.async_copy_ready_event.synchronize()
(Worker pid=2448772) (Worker_TP0 pid=2448772) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(Worker pid=2448772) (Worker_TP0 pid=2448772) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker pid=2448772) (Worker_TP0 pid=2448772) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=2448772) (Worker_TP0 pid=2448772) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=2448772) (Worker_TP0 pid=2448772) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker pid=2448772) (Worker_TP0 pid=2448772) 
  what():  [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fad48172fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fad485780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fac8d1008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fac8d10da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fac8d111539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fac8d1135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fad182e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fad5628b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fad56310b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fad48172fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7fac8c88c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7fad182e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7fad5628b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7fad56310b40 in /lib64/libc.so.6)

[rank2]:[E319 18:04:56.577129632 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7657edefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f7657f780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f759cf008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f759cf0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f759cf11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f759cf135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f76280e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f7665e8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f7665f10b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7657edefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f7657f780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f759cf008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f759cf0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f759cf11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f759cf135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f76280e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f7665e8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f7665f10b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7657edefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f759c68c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f76280e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f7665e8b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f7665f10b40 in /lib64/libc.so.6)

[rank6]:[E319 18:04:56.580444604 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7be5972fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f7be5d330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f7b2a9008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f7b2a90da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f7b2a911539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f7b2a9135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f7bb5ae77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f7bf3a8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f7bf3b10b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7be5972fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f7be5d330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f7b2a9008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f7b2a90da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f7b2a911539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f7b2a9135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f7bb5ae77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f7bf3a8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f7bf3b10b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7be5972fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f7b2a08c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f7bb5ae77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f7bf3a8b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f7bf3b10b40 in /lib64/libc.so.6)

[rank1]:[E319 18:04:56.586485967 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fcfa5b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fcfa5f780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fceeab008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fceeab0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fceeab11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fceeab135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fcf75ce77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fcfb3c8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fcfb3d10b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fcfa5b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fcfa5f780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fceeab008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fceeab0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fceeab11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fceeab135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fcf75ce77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fcfb3c8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fcfb3d10b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fcfa5b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7fceea28c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7fcf75ce77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7fcfb3c8b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7fcfb3d10b40 in /lib64/libc.so.6)

[rank5]:[E319 18:04:56.588674261 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f87b3b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f87b75330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f86fc5008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f86fc50da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f86fc511539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f86fc5135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f87876e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f87c548b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f87c5510b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f87b3b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f87b75330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f86fc5008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f86fc50da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f86fc511539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f86fc5135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f87876e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f87c548b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f87c5510b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f87b3b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f86fbc8c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f87876e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f87c548b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f87c5510b40 in /lib64/libc.so.6)

[rank4]:[E319 18:04:56.594518062 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f4116d72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f41171330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f405bd008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f405bd0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f405bd11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f405bd135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f40e6ee77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f4124e8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f4124f10b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f4116d72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f41171330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f405bd008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f405bd0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f405bd11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f405bd135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f40e6ee77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f4124e8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f4124f10b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f4116d72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f405b48c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f40e6ee77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f4124e8b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f4124f10b40 in /lib64/libc.so.6)

[rank7]:[E319 18:04:56.597273915 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7eff4badefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7eff4bb780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7efe90b008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7efe90b0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7efe90b11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7efe90b135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7eff1bce77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7eff59a8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7eff59b10b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7eff4badefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7eff4bb780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7efe90b008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7efe90b0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7efe90b11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7efe90b135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7eff1bce77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7eff59a8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7eff59b10b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7eff4badefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7efe9028c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7eff1bce77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7eff59a8b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7eff59b10b40 in /lib64/libc.so.6)
```
</details>


**Happy path:**
Start the vLLM(0.17.1) server with the zai-org/GLM-4.7-FP8 model and the following speculative decoding configuration, everything works fine:
```
vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--async-scheduling \
--enable-prefix-caching
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: cudaErrorIllegalAddress crash when enabling `--performance-mode throughput` for zai-org/GLM-4.7-FP8 under load #37587

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: cudaErrorIllegalAddress crash when enabling --performance-mode throughput for zai-org/GLM-4.7-FP8 under load #37587

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: cudaErrorIllegalAddress crash when enabling `--performance-mode throughput` for zai-org/GLM-4.7-FP8 under load #37587