-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Closed
Labels
Description
System Info
INFO 07-01 03:29:45 [__init__.py:244] Automatically detected platform cuda.
Collecting environment information...
==============================
System Info
==============================
OS : Ubuntu 22.04.4 LTS (x86_64)
GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version : Could not collect
CMake version : version 3.30.2
Libc version : glibc-2.35
==============================
PyTorch Info
==============================
PyTorch version : 2.7.1+cu126
Is debug build : False
CUDA used to build PyTorch : 12.6
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime)
Python platform : Linux-5.4.0-169-generic-x86_64-with-glibc2.35
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : 12.6.20
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version : 535.216.03
cuDNN version : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.3.0
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7763 64-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
Stepping: 1
Frequency boost: enabled
CPU max MHz: 2450.0000
CPU min MHz: 1500.0000
BogoMIPS: 4890.68
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca sme sev sev_es
Virtualization: AMD-V
L1d cache: 4 MiB (128 instances)
L1i cache: 4 MiB (128 instances)
L2 cache: 64 MiB (128 instances)
L3 cache: 512 MiB (16 instances)
NUMA node(s): 4
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
NUMA node2 CPU(s): 64-95
NUMA node3 CPU(s): 96-127
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] flash_attn==2.8.0.post2
[pip3] flake8==7.1.1
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cudnn-frontend==1.5.2
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-dali-cuda120==1.40.0
[pip3] nvidia-ml-py==12.575.51
[pip3] nvidia-ml-py3==7.352.0
[pip3] nvidia-modelopt==0.15.0
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvimgcodec-cu12==0.3.0.5
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] nvidia-pyindex==1.0.9
[pip3] nvidia-smi==0.1.3
[pip3] onnx==1.16.1
[pip3] onnxruntime-gpu==1.17.1
[pip3] onnxsim==0.4.36
[pip3] open-clip-torch==2.24.0
[pip3] optree==0.13.0
[pip3] pynvml==12.0.0
[pip3] pytorch-lightning==2.2.4
[pip3] pytorch-triton==3.0.0+dedb7bdf3
[pip3] pyzmq==26.2.0
[pip3] sentence-transformers==4.1.0
[pip3] torch==2.7.1
[pip3] torchaudio==2.7.0
[pip3] torchmetrics==1.4.0.post0
[pip3] torchpack==0.3.1
[pip3] torchprofile==0.0.4
[pip3] torchvision==0.22.1
[pip3] transformers==4.52.4
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.3.1
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.9.1
vLLM Build Flags:
CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 NIC0 NIC1 NIC2 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS 64-95 2 N/A
NIC0 SYS X SYS SYS
NIC1 SYS SYS X SYS
NIC2 SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
==============================
Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=GPU-84d7af4f-bb6d-9c62-0358-bcf0488cbbe5
CUBLAS_VERSION=12.6.0.22
NVIDIA_REQUIRE_CUDA=cuda>=9.0
CUDA_CACHE_DISABLE=1
TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX
NCCL_VERSION=2.22.3
NVIDIA_DRIVER_CAPABILITIES=video,compute,utility,graphics
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.6.0.022
PYTORCH_VERSION=2.5.0a0+872d972
PYTORCH_BUILD_NUMBER=0
CUDNN_FRONTEND_VERSION=1.5.2
CUDNN_VERSION=9.3.0.75
PYTORCH_HOME=/opt/pytorch/pytorch
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_BUILD_ID=107063150
CUDA_DRIVER_VERSION=560.35.03
PYTORCH_BUILD_VERSION=2.5.0a0+872d972
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NVIDIA_PYTORCH_VERSION=24.08
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
But when I use the following code to deploy the vllm servic
VLLM_USE_V1=0 \
vllm serve "Qwen/Qwen2.5-Omni-3B" \
--port "8080" \
--dtype bfloat16 \
--allowed-local-media-path / \
--served-model-name "Qwen2.5-Omni-3B" \
--limit-mm-per-prompt "image=12"transformers==4.53.0 with #39125
reported this error
ERROR 07-01 03:20:42 [engine.py:458] cu_seqlens_q must have shape (batch_size + 1)
ERROR 07-01 03:20:42 [engine.py:458] Traceback (most recent call last):
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
ERROR 07-01 03:20:42 [engine.py:458] engine = MQLLMEngine.from_vllm_config(
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
ERROR 07-01 03:20:42 [engine.py:458] return cls(
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 87, in __init__
ERROR 07-01 03:20:42 [engine.py:458] self.engine = LLMEngine(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 268, in __init__
ERROR 07-01 03:20:42 [engine.py:458] self._initialize_kv_caches()
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 413, in _initialize_kv_caches
ERROR 07-01 03:20:42 [engine.py:458] self.model_executor.determine_num_available_blocks())
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 104, in determine_num_available_blocks
ERROR 07-01 03:20:42 [engine.py:458] results = self.collective_rpc("determine_num_available_blocks")
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-01 03:20:42 [engine.py:458] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/utils.py", line 2671, in run_method
ERROR 07-01 03:20:42 [engine.py:458] return func(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-01 03:20:42 [engine.py:458] return func(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 256, in determine_num_available_blocks
ERROR 07-01 03:20:42 [engine.py:458] self.model_runner.profile_run()
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-01 03:20:42 [engine.py:458] return func(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1300, in profile_run
ERROR 07-01 03:20:42 [engine.py:458] self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1426, in _dummy_run
ERROR 07-01 03:20:42 [engine.py:458] self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-01 03:20:42 [engine.py:458] return func(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1844, in execute_model
ERROR 07-01 03:20:42 [engine.py:458] hidden_or_intermediate_states = model_executable(
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-01 03:20:42 [engine.py:458] return self._call_impl(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-01 03:20:42 [engine.py:458] return forward_call(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_omni_thinker.py", line 875, in forward
ERROR 07-01 03:20:42 [engine.py:458] multimodal_embeddings = self.get_multimodal_embeddings_v0(**kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_omni_thinker.py", line 831, in get_multimodal_embeddings_v0
ERROR 07-01 03:20:42 [engine.py:458] audio_embeds = self._process_audio_input(audio_input)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_omni_thinker.py", line 652, in _process_audio_input
ERROR 07-01 03:20:42 [engine.py:458] audio_outputs = self.audio_tower(
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-01 03:20:42 [engine.py:458] return self._call_impl(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-01 03:20:42 [engine.py:458] return forward_call(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 838, in forward
ERROR 07-01 03:20:42 [engine.py:458] layer_outputs = encoder_layer(hidden_states, cu_seqlens, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/transformers/modeling_layers.py", line 83, in __call__
ERROR 07-01 03:20:42 [engine.py:458] return super().__call__(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-01 03:20:42 [engine.py:458] return self._call_impl(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-01 03:20:42 [engine.py:458] return forward_call(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 704, in forward
ERROR 07-01 03:20:42 [engine.py:458] hidden_states = self.self_attn(
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-01 03:20:42 [engine.py:458] return self._call_impl(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-01 03:20:42 [engine.py:458] return forward_call(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 650, in forward
ERROR 07-01 03:20:42 [engine.py:458] attn_output, _ = attention_interface(
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/transformers/integrations/flash_attention.py", line 65, in flash_attention_forward
ERROR 07-01 03:20:42 [engine.py:458] attn_output = _flash_attention_forward(
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 520, in _flash_attention_forward
ERROR 07-01 03:20:42 [engine.py:458] attn_output_unpad = _flash_attn_varlen_func(
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
ERROR 07-01 03:20:42 [engine.py:458] return FlashAttnVarlenFunc.apply(
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
ERROR 07-01 03:20:42 [engine.py:458] return super().apply(*args, **kwargs) # type: ignore[misc]
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 925, in forward
ERROR 07-01 03:20:42 [engine.py:458] out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/_ops.py", line 1158, in __call__
ERROR 07-01 03:20:42 [engine.py:458] return self._op(*args, **(kwargs or {}))
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 335, in backend_impl
ERROR 07-01 03:20:42 [engine.py:458] result = self._backend_fns[device_type](*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/_compile.py", line 51, in inner
ERROR 07-01 03:20:42 [engine.py:458] return disable_fn(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 07-01 03:20:42 [engine.py:458] return fn(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 367, in wrapped_fn
ERROR 07-01 03:20:42 [engine.py:458] return fn(*args, **kwargs)
ERROR 07-01 03:20:42 [engine.py:458] File "/home/jun.zhou10/.local/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 165, in _flash_attn_varlen_forward
ERROR 07-01 03:20:42 [engine.py:458] out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
ERROR 07-01 03:20:42 [engine.py:458] RuntimeError: cu_seqlens_q must have shape (batch_size + 1)
full error log in log.txt
It runs normally under 4.52.4
Expected behavior
start server