[Bug] TP worker cuda graph capture NCCL error

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

Locate to #5728 
cc @Edenzzzz @merrymercy 
Possibly relate to `CUDA_VISABLE_DEVICES`

```
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.99it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.98it/s]

[2025-04-26 20:55:30 TP3] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=90.00 GB, mem usage=0.94 GB.
[2025-04-26 20:55:30 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=89.06 GB, mem usage=0.94 GB.
[2025-04-26 20:55:30 TP1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=88.82 GB, mem usage=0.94 GB.
[2025-04-26 20:55:30 TP2] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=88.82 GB, mem usage=0.94 GB.
[2025-04-26 20:55:30 TP2] KV Cache is allocated. #tokens: 4635840, K size: 30.95 GB, V size: 30.95 GB
[2025-04-26 20:55:30 TP0] KV Cache is allocated. #tokens: 4635840, K size: 30.95 GB, V size: 30.95 GB
[2025-04-26 20:55:30 TP0] Memory pool end. avail mem=27.14 GB
[2025-04-26 20:55:30 TP2] Memory pool end. avail mem=26.90 GB
[2025-04-26 20:55:30 TP3] KV Cache is allocated. #tokens: 4635840, K size: 30.95 GB, V size: 30.95 GB
[2025-04-26 20:55:30 TP3] Memory pool end. avail mem=28.07 GB
[2025-04-26 20:55:30 TP1] KV Cache is allocated. #tokens: 4635840, K size: 30.95 GB, V size: 30.95 GB
[2025-04-26 20:55:30 TP1] Memory pool end. avail mem=26.90 GB
[2025-04-26 20:55:30 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=27.04 GB
[2025-04-26 20:55:30 TP3] Capture cuda graph begin. This can take up to several minutes. avail mem=27.98 GB
[2025-04-26 20:55:30 TP2] Capture cuda graph begin. This can take up to several minutes. avail mem=26.81 GB
[2025-04-26 20:55:30 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=26.81 GB
Capturing batches (avail_mem=27.02 GB):   0%|          | 0/8 [00:00<?, ?it/s][TENCENT64:293198:0:293198] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x78ba1)
[TENCENT64:293195:0:293195] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x78ba1)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x436a28)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x436a28)
[TENCENT64:293197:0:293197] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x78ba1)
[TENCENT64:293196:0:293196] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x78ba1)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x436a28)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x436a28)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x436a28)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x436a28)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x436a28)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x436a28)
==== backtrace (tid: 293195) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000005bb28 ncclGroupCommJoin()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/include/group.h:113
 2 0x000000000005bb28 taskAppend()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2152
 3 0x000000000005bb28 ncclEnqueueCheck()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2224
 4 0x000000000004e991 ncclAllGather()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:88
 5 0x0000000000007e2e ffi_prep_go_closure()  ???:0
 6 0x0000000000004493 ???()  /lib/x86_64-linux-gnu/libffi.so.8:0
 7 0x000000000000a3e9 ???()  /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so:0
 8 0x0000000000013302 ???()  /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so:0
 9 0x000000000018139b _PyObject_MakeTpCall()  ???:0
10 0x000000000017aa97 _PyEval_EvalFrameDefault()  ???:0
11 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
12 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
13 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
14 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
15 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
16 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
18 0x0000000000a6d3a7 pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>()  :0
19 0x0000000000d96640 torch::impl::dispatch::PythonKernelHolder::operator()()  :0
20 0x00000000058bc27b c10::OperatorHandle::redispatchBoxed()  :0
21 0x00000000058b9af9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
22 0x0000000001aca9f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0
23 0x0000000000da1457 c10::Dispatcher::callBoxed()  ???:0
24 0x0000000000b2c2e6 torch::jit::invokeOperatorFromPython()  ???:0
25 0x0000000000b2c647 torch::jit::_get_operation_for_overload_or_packet()  ???:0
26 0x0000000000a1b592 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
27 0x0000000000518d37 pybind11::cpp_function::dispatcher()  :0
28 0x000000000018ab32 PyObject_CallFunctionObjArgs()  ???:0
29 0x000000000019910b PyObject_Call()  ???:0
30 0x000000000017b6ef _PyEval_EvalFrameDefault()  ???:0
31 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
32 0x000000000018061d _PyObject_FastCallDictTstate()  ???:0
33 0x000000000019562c _PyObject_Call_Prepend()  ???:0
34 0x000000000029d464 PyInit__datetime()  ???:0
35 0x000000000018139b _PyObject_MakeTpCall()  ???:0
36 0x000000000017b99e _PyEval_EvalFrameDefault()  ???:0
37 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
38 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
39 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
40 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
41 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
42 0x0000000000175790 _PyEval_EvalFrameDefault()  ???:0
43 0x00000000001984d1 PyMethod_New()  ???:0
44 0x000000000017a702 _PyEval_EvalFrameDefault()  ???:0
45 0x000000000019861e PyMethod_New()  ???:0
46 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
47 0x000000000019861e PyMethod_New()  ???:0
48 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
49 0x0000000000180574 _PyObject_FastCallDictTstate()  ???:0
50 0x000000000019562c _PyObject_Call_Prepend()  ???:0
51 0x000000000029d464 PyInit__datetime()  ???:0
52 0x000000000018139b _PyObject_MakeTpCall()  ???:0
53 0x000000000017b009 _PyEval_EvalFrameDefault()  ???:0
54 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
55 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
56 0x00000000001984d1 PyMethod_New()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007f6f51fff640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f6fb0a80640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f76a8c0b740 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 413 in ncclAllGather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 162 in all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 460 in _all_gather_into_tensor
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 149 in reg_all_gather_into_tensor
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123 in __call__
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 470 in all_gather_into_tensor
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 513 in all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/communication_op.py", line 20 in tensor_model_parallel_all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 445 in _get_logits
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 311 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen2.py", line 385 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 445 in run_once
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 452 in capture_one_batch_size
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 360 in capture
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 276 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 965 in init_cuda_graphs
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 219 in initialize
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 181 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 75 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 261 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2012 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, uvloop.loop, zmq.backend.cython._zmq, PIL._imaging, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, markupsafe._speedups, PIL._imagingft, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sentencepiece._sentencepiece, msgspec._core, _cffi_backend, msgpack._cmsgpack, google._upb._message, ray._raylet, cuda_utils (total: 197)
==== backtrace (tid: 293198) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000005bb28 ncclGroupCommJoin()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/include/group.h:113
 2 0x000000000005bb28 taskAppend()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2152
 3 0x000000000005bb28 ncclEnqueueCheck()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2224
 4 0x000000000004e991 ncclAllGather()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:88
 5 0x0000000000007e2e ffi_prep_go_closure()  ???:0
 6 0x0000000000004493 ???()  /lib/x86_64-linux-gnu/libffi.so.8:0
 7 0x000000000000a3e9 ???()  /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so:0
 8 0x0000000000013302 ???()  /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so:0
 9 0x000000000018139b _PyObject_MakeTpCall()  ???:0
10 0x000000000017aa97 _PyEval_EvalFrameDefault()  ???:0
11 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
12 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
13 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
14 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
15 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
16 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
18 0x0000000000a6d3a7 pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>()  :0
19 0x0000000000d96640 torch::impl::dispatch::PythonKernelHolder::operator()()  :0
20 0x00000000058bc27b c10::OperatorHandle::redispatchBoxed()  :0
21 0x00000000058b9af9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
22 0x0000000001aca9f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0
23 0x0000000000da1457 c10::Dispatcher::callBoxed()  ???:0
24 0x0000000000b2c2e6 torch::jit::invokeOperatorFromPython()  ???:0
25 0x0000000000b2c647 torch::jit::_get_operation_for_overload_or_packet()  ???:0
26 0x0000000000a1b592 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
27 0x0000000000518d37 pybind11::cpp_function::dispatcher()  :0
28 0x000000000018ab32 PyObject_CallFunctionObjArgs()  ???:0
29 0x000000000019910b PyObject_Call()  ???:0
30 0x000000000017b6ef _PyEval_EvalFrameDefault()  ???:0
31 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
32 0x000000000018061d _PyObject_FastCallDictTstate()  ???:0
33 0x000000000019562c _PyObject_Call_Prepend()  ???:0
34 0x000000000029d464 PyInit__datetime()  ???:0
35 0x000000000018139b _PyObject_MakeTpCall()  ???:0
36 0x000000000017b99e _PyEval_EvalFrameDefault()  ???:0
37 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
38 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
39 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
40 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
41 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
42 0x0000000000175790 _PyEval_EvalFrameDefault()  ???:0
43 0x00000000001984d1 PyMethod_New()  ???:0
44 0x000000000017a702 _PyEval_EvalFrameDefault()  ???:0
45 0x000000000019861e PyMethod_New()  ???:0
46 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
47 0x000000000019861e PyMethod_New()  ???:0
48 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
49 0x0000000000180574 _PyObject_FastCallDictTstate()  ???:0
50 0x000000000019562c _PyObject_Call_Prepend()  ???:0
51 0x000000000029d464 PyInit__datetime()  ???:0
52 0x000000000018139b _PyObject_MakeTpCall()  ???:0
53 0x000000000017b009 _PyEval_EvalFrameDefault()  ???:0
54 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
55 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
56 0x00000000001984d1 PyMethod_New()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007fd7a70e2640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007fdea6275740 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 413 in ncclAllGather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 162 in all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 460 in _all_gather_into_tensor
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 149 in reg_all_gather_into_tensor
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123 in __call__
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 470 in all_gather_into_tensor
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 513 in all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/communication_op.py", line 20 in tensor_model_parallel_all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 445 in _get_logits
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 311 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen2.py", line 385 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 445 in run_once
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 452 in capture_one_batch_size
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 360 in capture
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 276 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 965 in init_cuda_graphs
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 219 in initialize
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 181 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 75 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 261 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2012 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, uvloop.loop, zmq.backend.cython._zmq, PIL._imaging, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, markupsafe._speedups, PIL._imagingft, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sentencepiece._sentencepiece, msgspec._core, _cffi_backend, msgpack._cmsgpack, google._upb._message, ray._raylet, cuda_utils (total: 197)
==== backtrace (tid: 293197) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000005bb28 ncclGroupCommJoin()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/include/group.h:113
 2 0x000000000005bb28 taskAppend()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2152
 3 0x000000000005bb28 ncclEnqueueCheck()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2224
 4 0x000000000004e991 ncclAllGather()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:88
 5 0x0000000000007e2e ffi_prep_go_closure()  ???:0
 6 0x0000000000004493 ???()  /lib/x86_64-linux-gnu/libffi.so.8:0
 7 0x000000000000a3e9 ???()  /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so:0
 8 0x0000000000013302 ???()  /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so:0
 9 0x000000000018139b _PyObject_MakeTpCall()  ???:0
10 0x000000000017aa97 _PyEval_EvalFrameDefault()  ???:0
11 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
12 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
13 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
14 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
15 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
16 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
18 0x0000000000a6d3a7 pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>()  :0
19 0x0000000000d96640 torch::impl::dispatch::PythonKernelHolder::operator()()  :0
20 0x00000000058bc27b c10::OperatorHandle::redispatchBoxed()  :0
21 0x00000000058b9af9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
22 0x0000000001aca9f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0
23 0x0000000000da1457 c10::Dispatcher::callBoxed()  ???:0
24 0x0000000000b2c2e6 torch::jit::invokeOperatorFromPython()  ???:0
25 0x0000000000b2c647 torch::jit::_get_operation_for_overload_or_packet()  ???:0
26 0x0000000000a1b592 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
27 0x0000000000518d37 pybind11::cpp_function::dispatcher()  :0
28 0x000000000018ab32 PyObject_CallFunctionObjArgs()  ???:0
29 0x000000000019910b PyObject_Call()  ???:0
30 0x000000000017b6ef _PyEval_EvalFrameDefault()  ???:0
31 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
32 0x000000000018061d _PyObject_FastCallDictTstate()  ???:0
33 0x000000000019562c _PyObject_Call_Prepend()  ???:0
34 0x000000000029d464 PyInit__datetime()  ???:0
35 0x000000000018139b _PyObject_MakeTpCall()  ???:0
36 0x000000000017b99e _PyEval_EvalFrameDefault()  ???:0
37 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
38 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
39 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
40 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
41 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
42 0x0000000000175790 _PyEval_EvalFrameDefault()  ???:0
43 0x00000000001984d1 PyMethod_New()  ???:0
44 0x000000000017a702 _PyEval_EvalFrameDefault()  ???:0
45 0x000000000019861e PyMethod_New()  ???:0
46 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
47 0x000000000019861e PyMethod_New()  ???:0
48 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
49 0x0000000000180574 _PyObject_FastCallDictTstate()  ???:0
50 0x000000000019562c _PyObject_Call_Prepend()  ???:0
51 0x000000000029d464 PyInit__datetime()  ???:0
52 0x000000000018139b _PyObject_MakeTpCall()  ???:0
53 0x000000000017b009 _PyEval_EvalFrameDefault()  ???:0
54 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
55 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
56 0x00000000001984d1 PyMethod_New()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007f0ab78df640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f1218674740 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 413 in ncclAllGather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 162 in all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 460 in _all_gather_into_tensor
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 149 in reg_all_gather_into_tensor
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123 in __call__
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 470 in all_gather_into_tensor
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 513 in all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/communication_op.py", line 20 in tensor_model_parallel_all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 445 in _get_logits
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 311 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen2.py", line 385 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 445 in run_once
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 452 in capture_one_batch_size
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 360 in capture
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 276 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 965 in init_cuda_graphs
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 219 in initialize
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 181 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 75 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 261 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2012 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, uvloop.loop, zmq.backend.cython._zmq, PIL._imaging, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, markupsafe._speedups, PIL._imagingft, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sentencepiece._sentencepiece, msgspec._core, _cffi_backend, msgpack._cmsgpack, google._upb._message, ray._raylet, cuda_utils (total: 197)
==== backtrace (tid: 293196) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000005bb28 ncclGroupCommJoin()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/include/group.h:113
 2 0x000000000005bb28 taskAppend()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2152
 3 0x000000000005bb28 ncclEnqueueCheck()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2224
 4 0x000000000004e991 ncclAllGather()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:88
 5 0x0000000000007e2e ffi_prep_go_closure()  ???:0
 6 0x0000000000004493 ???()  /lib/x86_64-linux-gnu/libffi.so.8:0
 7 0x000000000000a3e9 ???()  /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so:0
 8 0x0000000000013302 ???()  /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so:0
 9 0x000000000018139b _PyObject_MakeTpCall()  ???:0
10 0x000000000017aa97 _PyEval_EvalFrameDefault()  ???:0
11 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
12 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
13 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
14 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
15 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
16 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
17 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
18 0x0000000000a6d3a7 pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>()  :0
19 0x0000000000d96640 torch::impl::dispatch::PythonKernelHolder::operator()()  :0
20 0x00000000058bc27b c10::OperatorHandle::redispatchBoxed()  :0
21 0x00000000058b9af9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
22 0x0000000001aca9f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0
23 0x0000000000da1457 c10::Dispatcher::callBoxed()  ???:0
24 0x0000000000b2c2e6 torch::jit::invokeOperatorFromPython()  ???:0
25 0x0000000000b2c647 torch::jit::_get_operation_for_overload_or_packet()  ???:0
26 0x0000000000a1b592 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
27 0x0000000000518d37 pybind11::cpp_function::dispatcher()  :0
28 0x000000000018ab32 PyObject_CallFunctionObjArgs()  ???:0
29 0x000000000019910b PyObject_Call()  ???:0
30 0x000000000017b6ef _PyEval_EvalFrameDefault()  ???:0
31 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
32 0x000000000018061d _PyObject_FastCallDictTstate()  ???:0
33 0x000000000019562c _PyObject_Call_Prepend()  ???:0
34 0x000000000029d464 PyInit__datetime()  ???:0
35 0x000000000018139b _PyObject_MakeTpCall()  ???:0
36 0x000000000017b99e _PyEval_EvalFrameDefault()  ???:0
37 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
38 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
39 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
40 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
41 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
42 0x0000000000175790 _PyEval_EvalFrameDefault()  ???:0
43 0x00000000001984d1 PyMethod_New()  ???:0
44 0x000000000017a702 _PyEval_EvalFrameDefault()  ???:0
45 0x000000000019861e PyMethod_New()  ???:0
46 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
47 0x000000000019861e PyMethod_New()  ???:0
48 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
49 0x0000000000180574 _PyObject_FastCallDictTstate()  ???:0
50 0x000000000019562c _PyObject_Call_Prepend()  ???:0
51 0x000000000029d464 PyInit__datetime()  ???:0
52 0x000000000018139b _PyObject_MakeTpCall()  ???:0
53 0x000000000017b009 _PyEval_EvalFrameDefault()  ???:0
54 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
55 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
56 0x00000000001984d1 PyMethod_New()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007f7dc08e1640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f84c807c740 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 413 in ncclAllGather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 162 in all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 460 in _all_gather_into_tensor
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 149 in reg_all_gather_into_tensor
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123 in __call__
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 470 in all_gather_into_tensor
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 513 in all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/communication_op.py", line 20 in tensor_model_parallel_all_gather
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 445 in _get_logits
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 311 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen2.py", line 385 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 445 in run_once
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 452 in capture_one_batch_size
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 360 in capture
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 276 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 965 in init_cuda_graphs
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 219 in initialize
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 181 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 75 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 261 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2012 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, uvloop.loop, zmq.backend.cython._zmq, PIL._imaging, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, markupsafe._speedups, PIL._imagingft, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sentencepiece._sentencepiece, msgspec._core, _cffi_backend, msgpack._cmsgpack, google._upb._message, ray._raylet, cuda_utils (total: 197)
[2025-04-26 20:55:31] Rank 0 scheduler is dead. Please check if there are relevant logs.
[2025-04-26 20:55:32] Child process unexpectedly failed with an exit code 11. pid=293198
[2025-04-26 20:55:32] Child process unexpectedly failed with an exit code 11. pid=293197
[2025-04-26 20:55:32] Child process unexpectedly failed with an exit code 11. pid=293196
[2025-04-26 20:55:32] Exit code: -11
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 14, in <module>
    launch_server(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 700, in launch_server
    tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 586, in _launch_subprocesses
    data = scheduler_pipe_readers[i].recv()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
```

### Reproduction

```
2025-04-26 20:54:53,631 - pdutils - INFO - runCommand remotely: ssh -o StrictHostKeyChecking=no  ytn0 "PS1=[] source ~/.bashrc  && env && ( CUDA_VISIBLE_DEVICES=0,1,2,3 UCX_TLS=rc,gdr_copy,rc_x,cuda_copy,cuda_ipc UCX_NET_DEVICES=mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1,mlx5_bond_4:1,mlx5_bond_5:1,mlx5_bond_6:1,mlx5_bond_7:1,mlx5_bond_8:1 UCX_LOG_LEVEL=info NCCL_DEBUG=WARN SGLANG_PD_NIXL_DEBUG_TRANSFER_TIME=1 SGL_ENABLE_JIT_DEEPGEMM=0 python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 1 --node-rank 0 --dist-init-addr ytn0:7010 --model-path /home/qspace/upload/luban_cache/model/luban-llm_deepseek_r1_distill_qwen_1_5b-model_path/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768 --allow-auto-truncate --tp 4 --log-level debug --enable-metrics --page-size 64 --disaggregation-mode prefill --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 7100 --max-running-requests 32 --port 8080 )"
2025-04-26 20:54:53,632 - pdutils - INFO - runCommand remotely: ssh -o StrictHostKeyChecking=no  ytn0 "PS1=[] source ~/.bashrc  && env && ( CUDA_VISIBLE_DEVICES=4,5,6,7 UCX_TLS=rc,gdr_copy,rc_x,cuda_copy,cuda_ipc UCX_NET_DEVICES=mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1,mlx5_bond_4:1,mlx5_bond_5:1,mlx5_bond_6:1,mlx5_bond_7:1,mlx5_bond_8:1 UCX_LOG_LEVEL=info NCCL_DEBUG=WARN SGLANG_PD_NIXL_DEBUG_TRANSFER_TIME=1 SGL_ENABLE_JIT_DEEPGEMM=0 python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 1 --node-rank 0 --dist-init-addr ytn0:7020 --model-path /home/qspace/upload/luban_cache/model/luban-llm_deepseek_r1_distill_qwen_1_5b-model_path/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768 --allow-auto-truncate --tp 4 --log-level debug --enable-metrics --page-size 64 --disaggregation-mode decode --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 7100 --max-running-requests 32 --port 9080 )"
```

### Environment

nccl 2.25.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] TP worker cuda graph capture NCCL error #5770

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] TP worker cuda graph capture NCCL error #5770

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions