Skip to content

[Bug]: cudaErrorIllegalAddress crash when enabling --performance-mode throughput for zai-org/GLM-4.7-FP8 under load #37587

@Xarbirus

Description

@Xarbirus

Your current environment

The output of python collect_env.py
Collecting environment information...
==============================
        System Info
==============================
OS                           : Amazon Linux 2023.10.20260216 (x86_64)
GCC version                  : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version                : Could not collect
CMake version                : version 3.22.2
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar 10 2026, 18:17:25) [Clang 21.1.4 ] (64-bit runtime)
Python platform              : Linux-6.1.161-183.298.amzn2023.x86_64-x86_64-with-glibc2.34

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200

Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           48 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 7R13 Processor
CPU family:                              25
Model:                                   1
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                1
BogoMIPS:                                5300.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               3 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                48 MiB (96 instances)
L3 cache:                                384 MiB (12 instances)
NUMA node(s):                            4
NUMA node0 CPU(s):                       0-23,96-119
NUMA node1 CPU(s):                       24-47,120-143
NUMA node2 CPU(s):                       48-71,144-167
NUMA node3 CPU(s):                       72-95,168-191
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Mitigation; Clear CPU buffers
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.3
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.0.0.19
[pip3] nvidia-cuda-cupti==13.0.48
[pip3] nvidia-cuda-nvrtc==13.0.48
[pip3] nvidia-cuda-runtime==13.0.48
[pip3] nvidia-cudnn-cu13==9.13.0.50
[pip3] nvidia-cudnn-frontend==1.20.0
[pip3] nvidia-cufft==12.0.0.15
[pip3] nvidia-cufile==1.15.0.42
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.3.29
[pip3] nvidia-cusparse==12.6.2.49
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu13==2.27.7
[pip3] nvidia-nvjitlink==13.0.39
[pip3] nvidia-nvshmem-cu13==3.3.24
[pip3] nvidia-nvtx==13.0.39
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.1+cu130
[pip3] torchaudio==2.9.1+cu130
[pip3] torchvision==0.24.1+cu130
[pip3] transformers==4.57.6
[pip3] triton==3.5.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.16.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	24-47,120-143	1		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	24-47,120-143	1		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	0-23,96-119	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	0-23,96-119	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	72-95,168-191	3		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	72-95,168-191	3		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	48-71,144-167	2		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	48-71,144-167	2		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

Description:
I am experiencing a critical crash (CUDA error: an illegal memory access was encountered, cudaErrorIllegalAddress) when serving the zai-org/GLM-4.7-FP8 model with --performance-mode throughput even with the first request.

The service runs perfectly fine in default and interactivity modes.

Steps to Reproduce:

  1. Start the vLLM(0.17.1) server with the zai-org/GLM-4.7-FP8 model and the following speculative decoding configuration:
vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--async-scheduling \
--enable-prefix-caching \
--performance-mode throughput

Start benchmark with:

vllm bench serve \
--model zai-org/GLM-4.7-FP8 \
--port 8000 \
--save-result \
--save-detailed \
--backend=vllm \
--dataset-name custom \
--dataset-path SOME_DATASET \
--disable-shuffle \
--metric-percentiles "50,90,95,99" \
--percentile-metrics "ttft,tpot,e2el" \
--result-dir "./vllm_bench_results/" \
--request-rate 1
  1. Send a request to the server.
  2. The server will suddenly crash with a CUDA error during request processing.
Error:
[rank3]:[E319 18:04:56.562835770 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fb0bdb72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fb0c15330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fb0065008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fb00650da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fb006511539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fb0065135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fb0916e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fb0cf48b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fb0cf510b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fb0bdb72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fb0c15330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fb0065008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fb00650da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fb006511539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fb0065135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fb0916e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fb0cf48b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fb0cf510b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fb0bdb72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7fb005c8c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7fb0916e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7fb0cf48b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7fb0cf510b40 in /lib64/libc.so.6)

(Worker pid=2448772) (Worker_TP0 pid=2448772) Exception in thread WorkerAsyncOutputCopy:
(Worker pid=2448772) (Worker_TP0 pid=2448772) Traceback (most recent call last):
(Worker pid=2448772) (Worker_TP0 pid=2448772)   File "/home/ssm-user/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
(Worker pid=2448772) (Worker_TP0 pid=2448772)     self.run()
(Worker pid=2448772) (Worker_TP0 pid=2448772)   File "/home/ssm-user/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012, in run
(Worker pid=2448772) (Worker_TP0 pid=2448772)     self._target(*self._args, **self._kwargs)
(Worker pid=2448772) (Worker_TP0 pid=2448772)   File "/home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 860, in async_output_busy_loop
(Worker pid=2448772) (Worker_TP0 pid=2448772)     self.enqueue_output(output)
(Worker pid=2448772) (Worker_TP0 pid=2448772)   File "/home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 837, in enqueue_output
[rank0]:[E319 18:04:56.570791183 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fad48172fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fad485780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fac8d1008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fac8d10da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fac8d111539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fac8d1135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fad182e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fad5628b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fad56310b40 in /lib64/libc.so.6)

terminate called after throwing an instance of '(Worker pid=2448772) (Worker_TP0 pid=2448772)     output = output.get_output()
c10::DistBackendError'
(Worker pid=2448772) (Worker_TP0 pid=2448772)              ^^^^^^^^^^^^^^^^^^^
(Worker pid=2448772) (Worker_TP0 pid=2448772)   File "/home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 251, in get_output
(Worker pid=2448772) (Worker_TP0 pid=2448772)     self.async_copy_ready_event.synchronize()
(Worker pid=2448772) (Worker_TP0 pid=2448772) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(Worker pid=2448772) (Worker_TP0 pid=2448772) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker pid=2448772) (Worker_TP0 pid=2448772) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=2448772) (Worker_TP0 pid=2448772) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=2448772) (Worker_TP0 pid=2448772) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker pid=2448772) (Worker_TP0 pid=2448772) 
  what():  [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fad48172fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fad485780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fac8d1008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fac8d10da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fac8d111539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fac8d1135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fad182e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fad5628b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fad56310b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fad48172fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7fac8c88c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7fad182e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7fad5628b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7fad56310b40 in /lib64/libc.so.6)

[rank2]:[E319 18:04:56.577129632 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7657edefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f7657f780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f759cf008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f759cf0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f759cf11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f759cf135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f76280e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f7665e8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f7665f10b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7657edefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f7657f780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f759cf008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f759cf0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f759cf11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f759cf135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f76280e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f7665e8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f7665f10b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7657edefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f759c68c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f76280e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f7665e8b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f7665f10b40 in /lib64/libc.so.6)

[rank6]:[E319 18:04:56.580444604 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7be5972fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f7be5d330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f7b2a9008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f7b2a90da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f7b2a911539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f7b2a9135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f7bb5ae77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f7bf3a8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f7bf3b10b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7be5972fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f7be5d330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f7b2a9008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f7b2a90da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f7b2a911539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f7b2a9135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f7bb5ae77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f7bf3a8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f7bf3b10b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f7be5972fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f7b2a08c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f7bb5ae77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f7bf3a8b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f7bf3b10b40 in /lib64/libc.so.6)

[rank1]:[E319 18:04:56.586485967 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fcfa5b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fcfa5f780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fceeab008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fceeab0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fceeab11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fceeab135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fcf75ce77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fcfb3c8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fcfb3d10b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fcfa5b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fcfa5f780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fceeab008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fceeab0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fceeab11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fceeab135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fcf75ce77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fcfb3c8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fcfb3d10b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fcfa5b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7fceea28c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7fcf75ce77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7fcfb3c8b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7fcfb3d10b40 in /lib64/libc.so.6)

[rank5]:[E319 18:04:56.588674261 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f87b3b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f87b75330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f86fc5008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f86fc50da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f86fc511539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f86fc5135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f87876e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f87c548b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f87c5510b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f87b3b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f87b75330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f86fc5008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f86fc50da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f86fc511539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f86fc5135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f87876e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f87c548b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f87c5510b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f87b3b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f86fbc8c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f87876e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f87c548b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f87c5510b40 in /lib64/libc.so.6)

[rank4]:[E319 18:04:56.594518062 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f4116d72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f41171330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f405bd008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f405bd0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f405bd11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f405bd135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f40e6ee77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f4124e8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f4124f10b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f4116d72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f41171330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f405bd008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f405bd0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f405bd11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f405bd135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f40e6ee77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f4124e8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f4124f10b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f4116d72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f405b48c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f40e6ee77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f4124e8b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f4124f10b40 in /lib64/libc.so.6)

[rank7]:[E319 18:04:56.597273915 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7eff4badefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7eff4bb780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7efe90b008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7efe90b0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7efe90b11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7efe90b135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7eff1bce77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7eff59a8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7eff59b10b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7eff4badefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7eff4bb780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7efe90b008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7efe90b0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7efe90b11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7efe90b135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7eff1bce77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7eff59a8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7eff59b10b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7eff4badefdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7efe9028c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7eff1bce77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7eff59a8b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7eff59b10b40 in /lib64/libc.so.6)

Happy path:
Start the vLLM(0.17.1) server with the zai-org/GLM-4.7-FP8 model and the following speculative decoding configuration, everything works fine:

vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--async-scheduling \
--enable-prefix-caching

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions