vllm + FlashAttention2 cannot run

### Description

vllm + FlashAttention2 cannot run

### Reproduction

qwen-asr-serve /pubdata/asr/Qwen/Qwen3-ASR-0.6B --gpu-memory-utilization 0.3 --host 0.0.0.0 --port 20220

### Logs

```shell
(qwen-asr) (base) root@f4c2b35b12bb /workspace/qwen-asr # qwen-asr-serve /pubdata/asr/Qwen/Qwen3-ASR-0.6B --gpu-memory-utilization 0.3 --host 0.0.0.0 --port 20220
WARNING 01-30 10:43:05 [registry.py:801] Model architecture Qwen3ASRForConditionalGeneration is already registered, and will be overwritten by the new model class <class 'qwen_asr.core.vllm_backend.qwen3_asr.Qwen3ASRForConditionalGeneration'>.
(APIServer pid=3359291) INFO 01-30 10:43:06 [api_server.py:1272] vLLM API server version 0.14.0
(APIServer pid=3359291) INFO 01-30 10:43:06 [utils.py:263] non-default args: {'model_tag': '/pubdata/asr/Qwen/Qwen3-ASR-0.6B', 'host': '0.0.0.0', 'port': 20220, 'model': '/pubdata/asr/Qwen/Qwen3-ASR-0.6B', 'gpu_memory_utilization': 0.1}
(APIServer pid=3359291) INFO 01-30 10:43:06 [model.py:530] Resolved architecture: Qwen3ASRForConditionalGeneration
(APIServer pid=3359291) ERROR 01-30 10:43:06 [repo_utils.py:65] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/pubdata/asr/Qwen/Qwen3-ASR-0.6B'. Use `repo_type` argument if needed., retrying 1 of 2
(APIServer pid=3359291) ERROR 01-30 10:43:08 [repo_utils.py:63] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/pubdata/asr/Qwen/Qwen3-ASR-0.6B'. Use `repo_type` argument if needed.
(APIServer pid=3359291) INFO 01-30 10:43:08 [model.py:1866] Downcasting torch.float32 to torch.bfloat16.
(APIServer pid=3359291) INFO 01-30 10:43:08 [model.py:1545] Using max model len 65536
(APIServer pid=3359291) INFO 01-30 10:43:08 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=3359291) INFO 01-30 10:43:08 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=3359291) INFO 01-30 10:43:08 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=3359291) The tokenizer you are loading from '/pubdata/asr/Qwen/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=3359291) The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
WARNING 01-30 10:43:17 [registry.py:801] Model architecture Qwen3ASRForConditionalGeneration is already registered, and will be overwritten by the new model class <class 'qwen_asr.core.vllm_backend.qwen3_asr.Qwen3ASRForConditionalGeneration'>.
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:18 [core.py:97] Initializing a V1 LLM engine (v0.14.0) with config: model='/pubdata/asr/Qwen/Qwen3-ASR-0.6B', speculative_config=None, tokenizer='/pubdata/asr/Qwen/Qwen3-ASR-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/pubdata/asr/Qwen/Qwen3-ASR-0.6B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=3359555) The tokenizer you are loading from '/pubdata/asr/Qwen/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:19 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.20.0.2:43935 backend=nccl
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:19 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=3359555) The tokenizer you are loading from '/pubdata/asr/Qwen/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore_DP0 pid=3359555) The tokenizer you are loading from '/pubdata/asr/Qwen/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:22 [gpu_model_runner.py:3808] Starting to load model /pubdata/asr/Qwen/Qwen3-ASR-0.6B...
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:23 [mm_encoder_attention.py:86] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:23 [vllm.py:630] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] EngineCore failed to start.
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] Traceback (most recent call last):
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     super().__init__(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     self._init_executor()
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     self.driver_worker.load_model()
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     self.model = model_loader.load_model(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     model = initialize_model(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]             ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/qwen_asr/core/vllm_backend/qwen3_asr.py", line 734, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     self.language_model = Qwen3ForCausalLM(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                           ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 274, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     self.model = Qwen3Model(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                  ^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 305, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     old_init(self, **kwargs)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 248, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     super().__init__(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 305, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     old_init(self, **kwargs)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 394, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                                                     ^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 396, in <lambda>
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     lambda prefix: decoder_layer_type(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                    ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 181, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     self.self_attn = Qwen3Attention(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                      ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 112, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     self.rotary_emb = get_rope(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                       ^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/__init__.py", line 96, in get_rope
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     rotary_emb = MRotaryEmbedding(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                  ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/mrope.py", line 237, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     super().__init__(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 58, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     self.apply_rotary_emb = ApplyRotaryEmb(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]                             ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/common.py", line 138, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     from flash_attn.ops.triton.rotary import apply_rotary
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn/__init__.py", line 3, in <module>
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     from flash_attn.flash_attn_interface import (
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 15, in <module>
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936]     import flash_attn_2_cuda as flash_attn_gpu
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ImportError: /workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZNK3c106SymInt6sym_neERKS0_
(EngineCore_DP0 pid=3359555) Process EngineCore_DP0:
(EngineCore_DP0 pid=3359555) Traceback (most recent call last):
(EngineCore_DP0 pid=3359555)   File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=3359555)     self.run()
(EngineCore_DP0 pid=3359555)   File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=3359555)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 940, in run_engine_core
(EngineCore_DP0 pid=3359555)     raise e
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=3359555)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=3359555)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=3359555)     super().__init__(
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=3359555)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=3359555)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=3359555)     self._init_executor()
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=3359555)     self.driver_worker.load_model()
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=3359555)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=3359555)     self.model = model_loader.load_model(
(EngineCore_DP0 pid=3359555)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_DP0 pid=3359555)     model = initialize_model(
(EngineCore_DP0 pid=3359555)             ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(EngineCore_DP0 pid=3359555)     return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=3359555)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/qwen_asr/core/vllm_backend/qwen3_asr.py", line 734, in __init__
(EngineCore_DP0 pid=3359555)     self.language_model = Qwen3ForCausalLM(
(EngineCore_DP0 pid=3359555)                           ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 274, in __init__
(EngineCore_DP0 pid=3359555)     self.model = Qwen3Model(
(EngineCore_DP0 pid=3359555)                  ^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 305, in __init__
(EngineCore_DP0 pid=3359555)     old_init(self, **kwargs)
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 248, in __init__
(EngineCore_DP0 pid=3359555)     super().__init__(
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 305, in __init__
(EngineCore_DP0 pid=3359555)     old_init(self, **kwargs)
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 394, in __init__
(EngineCore_DP0 pid=3359555)     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=3359555)                                                     ^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=3359555)     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=3359555)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 396, in <lambda>
(EngineCore_DP0 pid=3359555)     lambda prefix: decoder_layer_type(
(EngineCore_DP0 pid=3359555)                    ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 181, in __init__
(EngineCore_DP0 pid=3359555)     self.self_attn = Qwen3Attention(
(EngineCore_DP0 pid=3359555)                      ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 112, in __init__
(EngineCore_DP0 pid=3359555)     self.rotary_emb = get_rope(
(EngineCore_DP0 pid=3359555)                       ^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/__init__.py", line 96, in get_rope
(EngineCore_DP0 pid=3359555)     rotary_emb = MRotaryEmbedding(
(EngineCore_DP0 pid=3359555)                  ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/mrope.py", line 237, in __init__
(EngineCore_DP0 pid=3359555)     super().__init__(
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 58, in __init__
(EngineCore_DP0 pid=3359555)     self.apply_rotary_emb = ApplyRotaryEmb(
(EngineCore_DP0 pid=3359555)                             ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/common.py", line 138, in __init__
(EngineCore_DP0 pid=3359555)     from flash_attn.ops.triton.rotary import apply_rotary
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn/__init__.py", line 3, in <module>
(EngineCore_DP0 pid=3359555)     from flash_attn.flash_attn_interface import (
(EngineCore_DP0 pid=3359555)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 15, in <module>
(EngineCore_DP0 pid=3359555)     import flash_attn_2_cuda as flash_attn_gpu
(EngineCore_DP0 pid=3359555) ImportError: /workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZNK3c106SymInt6sym_neERKS0_
[rank0]:[W130 10:43:24.502221646 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=3359291) Traceback (most recent call last):
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/bin/qwen-asr-serve", line 10, in <module>
(APIServer pid=3359291)     sys.exit(main())
(APIServer pid=3359291)              ^^^^^^
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/qwen_asr/cli/serve.py", line 42, in main
(APIServer pid=3359291)     vllm_main()
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=3359291)     args.dispatch_function(args)
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=3359291)     uvloop.run(run_server(args))
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=3359291)     return __asyncio.run(
(APIServer pid=3359291)            ^^^^^^^^^^^^^^
(APIServer pid=3359291)   File "/opt/conda/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=3359291)     return runner.run(main)
(APIServer pid=3359291)            ^^^^^^^^^^^^^^^^
(APIServer pid=3359291)   File "/opt/conda/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=3359291)     return self._loop.run_until_complete(task)
(APIServer pid=3359291)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=3359291)     return await main
(APIServer pid=3359291)            ^^^^^^^^^^
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1319, in run_server
(APIServer pid=3359291)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1338, in run_server_worker
(APIServer pid=3359291)     async with build_async_engine_client(
(APIServer pid=3359291)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291)   File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=3359291)     return await anext(self.gen)
(APIServer pid=3359291)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client
(APIServer pid=3359291)     async with build_async_engine_client_from_engine_args(
(APIServer pid=3359291)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291)   File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=3359291)     return await anext(self.gen)
(APIServer pid=3359291)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 214, in build_async_engine_client_from_engine_args
(APIServer pid=3359291)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=3359291)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 205, in from_vllm_config
(APIServer pid=3359291)     return cls(
(APIServer pid=3359291)            ^^^^
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 132, in __init__
(APIServer pid=3359291)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=3359291)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
(APIServer pid=3359291)     return AsyncMPClient(*client_args)
(APIServer pid=3359291)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 824, in __init__
(APIServer pid=3359291)     super().__init__(
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 479, in __init__
(APIServer pid=3359291)     with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=3359291)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291)   File "/opt/conda/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=3359291)     next(self.gen)
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 921, in launch_core_engines
(APIServer pid=3359291)     wait_for_engine_startup(
(APIServer pid=3359291)   File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 980, in wait_for_engine_startup
(APIServer pid=3359291)     raise RuntimeError(
(APIServer pid=3359291) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
```

### Environment Information

4090 48G
NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8 
(qwen-asr) (base) root@f4c2b35b12bb /workspace/qwen-asr # uv pip list
Package                           Version
--------------------------------- -------------
accelerate                        1.12.0
aiofiles                          24.1.0
aiohappyeyeballs                  2.6.1
aiohttp                           3.13.3
aiosignal                         1.4.0
annotated-doc                     0.0.4
annotated-types                   0.7.0
anthropic                         0.71.0
anyio                             4.12.1
apache-tvm-ffi                    0.1.8.post2
astor                             0.8.1
attrs                             25.4.0
audioread                         3.1.0
av                                16.1.0
blake3                            1.0.8
blinker                           1.9.0
brotli                            1.2.0
cachetools                        6.2.6
cbor2                             5.8.0
certifi                           2026.1.4
cffi                              2.0.0
charset-normalizer                3.4.4
click                             8.3.1
cloudpickle                       3.1.2
compressed-tensors                0.13.0
cryptography                      46.0.4
cuda-bindings                     13.1.1
cuda-pathfinder                   1.3.3
cuda-python                       13.1.1
cupy-cuda12x                      13.6.0
cython                            3.2.4
decorator                         5.2.1
depyf                             0.20.0
dill                              0.4.1
diskcache                         5.6.3
distro                            1.9.0
dnspython                         2.8.0
docstring-parser                  0.17.0
dynet38                           2.2
einops                            0.8.2
email-validator                   2.3.0
fastapi                           0.128.0
fastapi-cli                       0.0.20
fastapi-cloud-cli                 0.11.0
fastar                            0.8.0
fastrlock                         0.8.3
ffmpy                             1.0.0
filelock                          3.20.3
flash-attn                        2.8.3
flashinfer-python                 0.5.3
flask                             3.1.2
frozenlist                        1.8.0
fsspec                            2026.1.0
gguf                              0.17.1
gradio                            6.5.1
gradio-client                     2.0.3
groovy                            0.1.2
grpcio                            1.76.0
grpcio-reflection                 1.76.0
h11                               0.16.0
hf-xet                            1.2.0
httpcore                          1.0.9
httptools                         0.7.1
httpx                             0.28.1
httpx-sse                         0.4.3
huggingface-hub                   0.36.0
idna                              3.11
ijson                             3.4.0.post0
interegular                       0.3.3
itsdangerous                      2.2.0
jinja2                            3.1.6
jiter                             0.12.0
jmespath                          1.1.0
joblib                            1.5.3
jsonschema                        4.26.0
jsonschema-specifications         2025.9.1
lark                              1.2.2
lazy-loader                       0.4
librosa                           0.11.0
llguidance                        1.3.0
llvmlite                          0.44.0
lm-format-enforcer                0.11.3
loguru                            0.7.3
markdown-it-py                    4.0.0
markupsafe                        3.0.3
mcp                               1.26.0
mdurl                             0.1.2
mistral-common                    1.9.0
model-hosting-container-standards 0.1.13
mpmath                            1.3.0
msgpack                           1.1.2
msgspec                           0.20.0
multidict                         6.7.1
nagisa                            0.2.11
networkx                          3.6.1
ninja                             1.13.0
numba                             0.61.2
numpy                             2.2.6
nvidia-cublas-cu12                12.8.4.1
nvidia-cuda-cupti-cu12            12.8.90
nvidia-cuda-nvrtc-cu12            12.8.93
nvidia-cuda-runtime-cu12          12.8.90
nvidia-cudnn-cu12                 9.10.2.21
nvidia-cudnn-frontend             1.18.0
nvidia-cufft-cu12                 11.3.3.83
nvidia-cufile-cu12                1.13.1.3
nvidia-curand-cu12                10.3.9.90
nvidia-cusolver-cu12              11.7.3.90
nvidia-cusparse-cu12              12.5.8.93
nvidia-cusparselt-cu12            0.7.1
nvidia-cutlass-dsl                4.3.5
nvidia-ml-py                      13.590.48
nvidia-nccl-cu12                  2.27.5
nvidia-nvjitlink-cu12             12.8.93
nvidia-nvshmem-cu12               3.3.20
nvidia-nvtx-cu12                  12.8.90
openai                            2.16.0
openai-harmony                    0.0.8
opencv-python-headless            4.13.0.90
orjson                            3.11.6
outlines-core                     0.2.11
packaging                         26.0
pandas                            3.0.0
partial-json-parser               0.2.1.1.post7
pillow                            12.1.0
platformdirs                      4.5.1
pooch                             1.8.2
prometheus-client                 0.24.1
prometheus-fastapi-instrumentator 7.1.0
propcache                         0.4.1
protobuf                          6.33.5
psutil                            7.2.2
py-cpuinfo                        9.0.0
pybase64                          1.4.3
pycountry                         24.6.1
pycparser                         3.0
pydantic                          2.12.5
pydantic-core                     2.41.5
pydantic-extra-types              2.11.0
pydantic-settings                 2.12.0
pydub                             0.25.1
pygments                          2.19.2
pyjwt                             2.10.1
python-dateutil                   2.9.0.post0
python-dotenv                     1.2.1
python-json-logger                4.0.0
python-multipart                  0.0.22
pytz                              2025.2
pyyaml                            6.0.3
pyzmq                             27.1.0
qwen-asr                          0.0.4
qwen-omni-utils                   0.0.8
ray                               2.53.0
referencing                       0.37.0
regex                             2026.1.15
requests                          2.32.5
rich                              14.3.1
rich-toolkit                      0.17.2
rignore                           0.7.6
rpds-py                           0.30.0
safehttpx                         0.1.7
safetensors                       0.7.0
scikit-learn                      1.8.0
scipy                             1.17.0
semantic-version                  2.10.0
sentencepiece                     0.2.1
sentry-sdk                        2.51.0
setproctitle                      1.3.7
setuptools                        80.10.2
shellingham                       1.5.4
six                               1.17.0
sniffio                           1.3.1
soundfile                         0.13.1
sox                               1.5.0
soxr                              1.0.0
soynlp                            0.0.493
sse-starlette                     3.2.0
starlette                         0.50.0
supervisor                        4.3.0
sympy                             1.14.0
tabulate                          0.9.0
threadpoolctl                     3.6.0
tiktoken                          0.12.0
tokenizers                        0.22.2
tomlkit                           0.13.3
torch                             2.9.1
torchaudio                        2.9.1
torchvision                       0.24.1
tqdm                              4.67.1
transformers                      4.57.6
triton                            3.5.1
typer                             0.21.1
typing-extensions                 4.15.0
typing-inspection                 0.4.2
urllib3                           2.6.3
uvicorn                           0.40.0
uvloop                            0.22.1
vllm                              0.14.0
watchfiles                        1.1.1
websockets                        16.0
werkzeug                          3.1.5
xgrammar                          0.1.29
yarl                              1.22.0

### Known Issue

- [x] The issue hasn't been already addressed in Documentation, Issues, and Discussions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vllm + FlashAttention2 cannot run #15

Description

Reproduction

Logs

Environment Information

Known Issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

vllm + FlashAttention2 cannot run #15

Description

Description

Reproduction

Logs

Environment Information

Known Issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions