Skip to content

[Bug]: cudaErrorIllegalAddress crash when running zai-org/GLM-4.7-FP8 with --max-num-batched-tokens < default (e.g. 4K) under #37599

@Xarbirus

Description

@Xarbirus

Your current environment

The output of python collect_env.py
Collecting environment information...
==============================
        System Info
==============================
OS                           : Amazon Linux 2023.10.20260216 (x86_64)
GCC version                  : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version                : Could not collect
CMake version                : version 3.22.2
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar 10 2026, 18:17:25) [Clang 21.1.4 ] (64-bit runtime)
Python platform              : Linux-6.1.161-183.298.amzn2023.x86_64-x86_64-with-glibc2.34

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200

Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           48 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 7R13 Processor
CPU family:                              25
Model:                                   1
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                1
BogoMIPS:                                5300.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               3 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                48 MiB (96 instances)
L3 cache:                                384 MiB (12 instances)
NUMA node(s):                            4
NUMA node0 CPU(s):                       0-23,96-119
NUMA node1 CPU(s):                       24-47,120-143
NUMA node2 CPU(s):                       48-71,144-167
NUMA node3 CPU(s):                       72-95,168-191
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Mitigation; Clear CPU buffers
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.3
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.0.0.19
[pip3] nvidia-cuda-cupti==13.0.48
[pip3] nvidia-cuda-nvrtc==13.0.48
[pip3] nvidia-cuda-runtime==13.0.48
[pip3] nvidia-cudnn-cu13==9.13.0.50
[pip3] nvidia-cudnn-frontend==1.20.0
[pip3] nvidia-cufft==12.0.0.15
[pip3] nvidia-cufile==1.15.0.42
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.3.29
[pip3] nvidia-cusparse==12.6.2.49
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu13==2.27.7
[pip3] nvidia-nvjitlink==13.0.39
[pip3] nvidia-nvshmem-cu13==3.3.24
[pip3] nvidia-nvtx==13.0.39
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.1+cu130
[pip3] torchaudio==2.9.1+cu130
[pip3] torchvision==0.24.1+cu130
[pip3] transformers==4.57.6
[pip3] triton==3.5.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.16.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	24-47,120-143	1		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	24-47,120-143	1		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	0-23,96-119	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	0-23,96-119	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	72-95,168-191	3		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	72-95,168-191	3		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	48-71,144-167	2		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	48-71,144-167	2		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

Description:
I am experiencing a critical crash (CUDA error: an illegal memory access was encountered, cudaErrorIllegalAddress) when serving the zai-org/GLM-4.7-FP8 model with --max-num-batched-tokens < default value immediately after first requests.

The service runs perfectly fine without explicit --max-num-batched-tokens.

Steps to Reproduce:

  1. Start the vLLM(0.17.1) server with the zai-org/GLM-4.7-FP8 model and the following speculative decoding configuration:
vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--async-scheduling \
--enable-prefix-caching \
--max-num-batched-tokens 4K

Start benchmark with:

vllm bench serve \
--model zai-org/GLM-4.7-FP8 \
--port 8000 \
--save-result \
--save-detailed \
--backend=vllm \
--dataset-name custom \
--dataset-path SOME_DATASET \
--disable-shuffle \
--metric-percentiles "50,90,95,99" \
--percentile-metrics "ttft,tpot,e2el" \
--result-dir "./vllm_bench_results/" \
--request-rate 1
  1. Send a request to the server.
  2. The server will suddenly crash with a CUDA error during request processing.
Error:
[rank2]:[E319 19:55:16.862637187 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fdb36372fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fdb367780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fda7b3008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fda7b30da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fda7b311539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fda7b3135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fdb064e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fdb4448b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fdb44510b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fdb36372fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fdb367780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fda7b3008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fda7b30da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fda7b311539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fda7b3135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fdb064e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fdb4448b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fdb44510b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fdb36372fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7fda7aa8c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7fdb064e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7fdb4448b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7fdb44510b40 in /lib64/libc.so.6)

(Worker pid=2513317) (Worker_TP0 pid=2513317) Exception in thread WorkerAsyncOutputCopy:
(Worker pid=2513317) (Worker_TP0 pid=2513317) Traceback (most recent call last):
(Worker pid=2513317) (Worker_TP0 pid=2513317)   File "/home/ssm-user/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
(Worker pid=2513317) (Worker_TP0 pid=2513317)     self.run()
(Worker pid=2513317) (Worker_TP0 pid=2513317)   File "/home/ssm-user/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/threading.py", line 1012, in run
(Worker pid=2513317) (Worker_TP0 pid=2513317)     self._target(*self._args, **self._kwargs)
(Worker pid=2513317) (Worker_TP0 pid=2513317)   File "/home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 860, in async_output_busy_loop
(Worker pid=2513317) (Worker_TP0 pid=2513317)     self.enqueue_output(output)
(Worker pid=2513317) (Worker_TP0 pid=2513317)   File "/home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 837, in enqueue_output
(Worker pid=2513317) (Worker_TP0 pid=2513317)     output = output.get_output()
(Worker pid=2513317) (Worker_TP0 pid=2513317)              ^^^^^^^^^^^^^^^^^^^
(Worker pid=2513317) (Worker_TP0 pid=2513317)   File "/home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 251, in get_output
(Worker pid=2513317) (Worker_TP0 pid=2513317)     self.async_copy_ready_event.synchronize()
(Worker pid=2513317) (Worker_TP0 pid=2513317) torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(Worker pid=2513317) (Worker_TP0 pid=2513317) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker pid=2513317) (Worker_TP0 pid=2513317) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=2513317) (Worker_TP0 pid=2513317) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=2513317) (Worker_TP0 pid=2513317) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker pid=2513317) (Worker_TP0 pid=2513317) 
[rank0]:[W319 19:55:16.871470664 CUDAGuardImpl.h:122] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::AcceleratorError'
  what():  CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from currentStreamCaptureStatusMayInitCtx at /pytorch/c10/cuda/CUDAGraphsC10Utils.h:71 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fe47c8defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fe47c9780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0xce4b3a (0x7fe3c16e4b3a in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x7e9d4 (0x7fe47c8c09d4 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #4: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fe47c8ba369 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x862f65 (0x7fe3f4e62f65 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x863001 (0x7fe3f4e63001 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #7: VLLM::Worker_TP0() [0x1635231]
frame #8: VLLM::Worker_TP0() [0x163537b]
frame #9: VLLM::Worker_TP0() [0x1633f63]
frame #10: VLLM::Worker_TP0() [0x1633ab3]
frame #11: VLLM::Worker_TP0() [0x1633ad6]
frame #12: VLLM::Worker_TP0() [0x1633ad6]
frame #13: VLLM::Worker_TP0() [0x1633c27]
frame #14: VLLM::Worker_TP0() [0x1635231]
frame #15: _PyEval_EvalFrameDefault + 0xe5ec (0x16200ec in VLLM::Worker_TP0)
frame #16: VLLM::Worker_TP0() [0x161122d]
frame #17: VLLM::Worker_TP0() [0x1740925]
frame #18: VLLM::Worker_TP0() [0x1740861]
frame #19: <unknown function> + 0x8b2ea (0x7fe48a88b2ea in /lib64/libc.so.6)
frame #20: <unknown function> + 0x110b40 (0x7fe48a910b40 in /lib64/libc.so.6)

[rank7]:[E319 19:55:16.880866178 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5bbb572fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f5bbef330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f5b03f008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f5b03f0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f5b03f11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f5b03f135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f5b8f0e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f5bcce8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f5bccf10b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5bbb572fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f5bbef330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f5b03f008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f5b03f0da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f5b03f11539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f5b03f135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f5b8f0e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f5bcce8b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f5bccf10b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5bbb572fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f5b0368c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f5b8f0e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f5bcce8b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f5bccf10b40 in /lib64/libc.so.6)

[rank5]:[E319 19:55:16.897002377 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f88880defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f88881780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f87cd1008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f87cd10da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f87cd111539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f87cd1135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f88582e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f889608b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f8896110b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f88880defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f88881780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f87cd1008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f87cd10da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f87cd111539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f87cd1135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f88582e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f889608b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f8896110b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f88880defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f87cc88c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f88582e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f889608b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f8896110b40 in /lib64/libc.so.6)

[rank4]:[E319 19:55:16.899273792 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f196e0defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f196e1780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f18b31008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f18b310da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f18b3111539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f18b31135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f193e2e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f197c08b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f197c110b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f196e0defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f196e1780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f18b31008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f18b310da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f18b3111539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f18b31135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f193e2e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f197c08b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f197c110b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f196e0defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f18b288c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f193e2e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f197c08b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f197c110b40 in /lib64/libc.so.6)

[rank6]:[E319 19:55:16.904900358 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fbf066defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fbf067780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fbe4b7008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fbe4b70da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fbe4b711539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fbe4b7135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fbed68e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fbf1468b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fbf14710b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fbf066defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7fbf067780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7fbe4b7008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7fbe4b70da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7fbe4b711539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7fbe4b7135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7fbed68e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7fbf1468b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7fbf14710b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7fbf066defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7fbe4ae8c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7fbed68e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7fbf1468b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7fbf14710b40 in /lib64/libc.so.6)

[rank1]:[E319 19:55:16.907649710 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5a900defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f5a901780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f59d51008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f59d510da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f59d5111539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f59d51135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f5a602e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f5a9e08b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f5a9e110b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5a900defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7f5a901780e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f59d51008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f59d510da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7f59d5111539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7f59d51135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7f5a602e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7f5a9e08b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7f5a9e110b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7f5a900defdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7f59d488c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7f5a602e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7f5a9e08b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7f5a9e110b40 in /lib64/libc.so.6)

[rank3]:[E319 19:55:16.908538094 ProcessGroupNCCL.cpp:2093] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7ff3f7b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7ff3fb5330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7ff3405008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7ff34050da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7ff340511539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7ff3405135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7ff3cb6e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7ff40948b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7ff409510b40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from query at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:108 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7ff3f7b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc0e0 (0x7ff3fb5330e0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7ff3405008f0 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7ff34050da68 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x949 (0x7ff340511539 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0x105 (0x7ff3405135d5 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe77e4 (0x7ff3cb6e77e4 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x8b2ea (0x7ff40948b2ea in /lib64/libc.so.6)
frame #8: <unknown function> + 0x110b40 (0x7ff409510b40 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2099 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9d (0x7ff3f7b72fdd in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x68c348 (0x7ff33fc8c348 in /home/ssm-user/mikhail.podvitskii/vllm-env/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe77e4 (0x7ff3cb6e77e4 in /lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x8b2ea (0x7ff40948b2ea in /lib64/libc.so.6)
frame #4: <unknown function> + 0x110b40 (0x7ff409510b40 in /lib64/libc.so.6)

Happy path:
Start the vLLM(0.17.1) server with the zai-org/GLM-4.7-FP8 model and default max-num-batched-tokens, everything works fine:

vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--async-scheduling \
--enable-prefix-caching

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions