qwen-asr-serve /pubdata/asr/Qwen/Qwen3-ASR-0.6B --gpu-memory-utilization 0.3 --host 0.0.0.0 --port 20220
(qwen-asr) (base) root@f4c2b35b12bb /workspace/qwen-asr # qwen-asr-serve /pubdata/asr/Qwen/Qwen3-ASR-0.6B --gpu-memory-utilization 0.3 --host 0.0.0.0 --port 20220
WARNING 01-30 10:43:05 [registry.py:801] Model architecture Qwen3ASRForConditionalGeneration is already registered, and will be overwritten by the new model class <class 'qwen_asr.core.vllm_backend.qwen3_asr.Qwen3ASRForConditionalGeneration'>.
(APIServer pid=3359291) INFO 01-30 10:43:06 [api_server.py:1272] vLLM API server version 0.14.0
(APIServer pid=3359291) INFO 01-30 10:43:06 [utils.py:263] non-default args: {'model_tag': '/pubdata/asr/Qwen/Qwen3-ASR-0.6B', 'host': '0.0.0.0', 'port': 20220, 'model': '/pubdata/asr/Qwen/Qwen3-ASR-0.6B', 'gpu_memory_utilization': 0.1}
(APIServer pid=3359291) INFO 01-30 10:43:06 [model.py:530] Resolved architecture: Qwen3ASRForConditionalGeneration
(APIServer pid=3359291) ERROR 01-30 10:43:06 [repo_utils.py:65] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/pubdata/asr/Qwen/Qwen3-ASR-0.6B'. Use `repo_type` argument if needed., retrying 1 of 2
(APIServer pid=3359291) ERROR 01-30 10:43:08 [repo_utils.py:63] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/pubdata/asr/Qwen/Qwen3-ASR-0.6B'. Use `repo_type` argument if needed.
(APIServer pid=3359291) INFO 01-30 10:43:08 [model.py:1866] Downcasting torch.float32 to torch.bfloat16.
(APIServer pid=3359291) INFO 01-30 10:43:08 [model.py:1545] Using max model len 65536
(APIServer pid=3359291) INFO 01-30 10:43:08 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=3359291) INFO 01-30 10:43:08 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=3359291) INFO 01-30 10:43:08 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=3359291) The tokenizer you are loading from '/pubdata/asr/Qwen/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=3359291) The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
WARNING 01-30 10:43:17 [registry.py:801] Model architecture Qwen3ASRForConditionalGeneration is already registered, and will be overwritten by the new model class <class 'qwen_asr.core.vllm_backend.qwen3_asr.Qwen3ASRForConditionalGeneration'>.
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:18 [core.py:97] Initializing a V1 LLM engine (v0.14.0) with config: model='/pubdata/asr/Qwen/Qwen3-ASR-0.6B', speculative_config=None, tokenizer='/pubdata/asr/Qwen/Qwen3-ASR-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/pubdata/asr/Qwen/Qwen3-ASR-0.6B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=3359555) The tokenizer you are loading from '/pubdata/asr/Qwen/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:19 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.20.0.2:43935 backend=nccl
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:19 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=3359555) The tokenizer you are loading from '/pubdata/asr/Qwen/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore_DP0 pid=3359555) The tokenizer you are loading from '/pubdata/asr/Qwen/Qwen3-ASR-0.6B' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:22 [gpu_model_runner.py:3808] Starting to load model /pubdata/asr/Qwen/Qwen3-ASR-0.6B...
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:23 [mm_encoder_attention.py:86] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=3359555) INFO 01-30 10:43:23 [vllm.py:630] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] EngineCore failed to start.
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] Traceback (most recent call last):
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] super().__init__(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] self._init_executor()
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] self.driver_worker.load_model()
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] self.model = model_loader.load_model(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] model = initialize_model(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/qwen_asr/core/vllm_backend/qwen3_asr.py", line 734, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] self.language_model = Qwen3ForCausalLM(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 274, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] self.model = Qwen3Model(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 305, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] old_init(self, **kwargs)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 248, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] super().__init__(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 305, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] old_init(self, **kwargs)
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 394, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 396, in <lambda>
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] lambda prefix: decoder_layer_type(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 181, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] self.self_attn = Qwen3Attention(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 112, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] self.rotary_emb = get_rope(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/__init__.py", line 96, in get_rope
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] rotary_emb = MRotaryEmbedding(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/mrope.py", line 237, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] super().__init__(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 58, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] self.apply_rotary_emb = ApplyRotaryEmb(
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/common.py", line 138, in __init__
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] from flash_attn.ops.triton.rotary import apply_rotary
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn/__init__.py", line 3, in <module>
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] from flash_attn.flash_attn_interface import (
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 15, in <module>
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] import flash_attn_2_cuda as flash_attn_gpu
(EngineCore_DP0 pid=3359555) ERROR 01-30 10:43:23 [core.py:936] ImportError: /workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZNK3c106SymInt6sym_neERKS0_
(EngineCore_DP0 pid=3359555) Process EngineCore_DP0:
(EngineCore_DP0 pid=3359555) Traceback (most recent call last):
(EngineCore_DP0 pid=3359555) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=3359555) self.run()
(EngineCore_DP0 pid=3359555) File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=3359555) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 940, in run_engine_core
(EngineCore_DP0 pid=3359555) raise e
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=3359555) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=3359555) super().__init__(
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=3359555) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=3359555) self._init_executor()
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=3359555) self.driver_worker.load_model()
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=3359555) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=3359555) self.model = model_loader.load_model(
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_DP0 pid=3359555) model = initialize_model(
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(EngineCore_DP0 pid=3359555) return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/qwen_asr/core/vllm_backend/qwen3_asr.py", line 734, in __init__
(EngineCore_DP0 pid=3359555) self.language_model = Qwen3ForCausalLM(
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 274, in __init__
(EngineCore_DP0 pid=3359555) self.model = Qwen3Model(
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 305, in __init__
(EngineCore_DP0 pid=3359555) old_init(self, **kwargs)
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 248, in __init__
(EngineCore_DP0 pid=3359555) super().__init__(
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 305, in __init__
(EngineCore_DP0 pid=3359555) old_init(self, **kwargs)
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 394, in __init__
(EngineCore_DP0 pid=3359555) self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=3359555) maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 396, in <lambda>
(EngineCore_DP0 pid=3359555) lambda prefix: decoder_layer_type(
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 181, in __init__
(EngineCore_DP0 pid=3359555) self.self_attn = Qwen3Attention(
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 112, in __init__
(EngineCore_DP0 pid=3359555) self.rotary_emb = get_rope(
(EngineCore_DP0 pid=3359555) ^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/__init__.py", line 96, in get_rope
(EngineCore_DP0 pid=3359555) rotary_emb = MRotaryEmbedding(
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/mrope.py", line 237, in __init__
(EngineCore_DP0 pid=3359555) super().__init__(
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 58, in __init__
(EngineCore_DP0 pid=3359555) self.apply_rotary_emb = ApplyRotaryEmb(
(EngineCore_DP0 pid=3359555) ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/rotary_embedding/common.py", line 138, in __init__
(EngineCore_DP0 pid=3359555) from flash_attn.ops.triton.rotary import apply_rotary
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn/__init__.py", line 3, in <module>
(EngineCore_DP0 pid=3359555) from flash_attn.flash_attn_interface import (
(EngineCore_DP0 pid=3359555) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py", line 15, in <module>
(EngineCore_DP0 pid=3359555) import flash_attn_2_cuda as flash_attn_gpu
(EngineCore_DP0 pid=3359555) ImportError: /workspace/qwen-asr/.venv/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZNK3c106SymInt6sym_neERKS0_
[rank0]:[W130 10:43:24.502221646 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=3359291) Traceback (most recent call last):
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/bin/qwen-asr-serve", line 10, in <module>
(APIServer pid=3359291) sys.exit(main())
(APIServer pid=3359291) ^^^^^^
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/qwen_asr/cli/serve.py", line 42, in main
(APIServer pid=3359291) vllm_main()
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=3359291) args.dispatch_function(args)
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=3359291) uvloop.run(run_server(args))
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=3359291) return __asyncio.run(
(APIServer pid=3359291) ^^^^^^^^^^^^^^
(APIServer pid=3359291) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=3359291) return runner.run(main)
(APIServer pid=3359291) ^^^^^^^^^^^^^^^^
(APIServer pid=3359291) File "/opt/conda/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=3359291) return self._loop.run_until_complete(task)
(APIServer pid=3359291) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=3359291) return await main
(APIServer pid=3359291) ^^^^^^^^^^
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1319, in run_server
(APIServer pid=3359291) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1338, in run_server_worker
(APIServer pid=3359291) async with build_async_engine_client(
(APIServer pid=3359291) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=3359291) return await anext(self.gen)
(APIServer pid=3359291) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client
(APIServer pid=3359291) async with build_async_engine_client_from_engine_args(
(APIServer pid=3359291) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291) File "/opt/conda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=3359291) return await anext(self.gen)
(APIServer pid=3359291) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 214, in build_async_engine_client_from_engine_args
(APIServer pid=3359291) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=3359291) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 205, in from_vllm_config
(APIServer pid=3359291) return cls(
(APIServer pid=3359291) ^^^^
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 132, in __init__
(APIServer pid=3359291) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=3359291) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
(APIServer pid=3359291) return AsyncMPClient(*client_args)
(APIServer pid=3359291) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 824, in __init__
(APIServer pid=3359291) super().__init__(
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 479, in __init__
(APIServer pid=3359291) with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=3359291) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3359291) File "/opt/conda/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=3359291) next(self.gen)
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 921, in launch_core_engines
(APIServer pid=3359291) wait_for_engine_startup(
(APIServer pid=3359291) File "/workspace/qwen-asr/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 980, in wait_for_engine_startup
(APIServer pid=3359291) raise RuntimeError(
(APIServer pid=3359291) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
Description
vllm + FlashAttention2 cannot run
Reproduction
qwen-asr-serve /pubdata/asr/Qwen/Qwen3-ASR-0.6B --gpu-memory-utilization 0.3 --host 0.0.0.0 --port 20220
Logs
Environment Information
4090 48G
NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8
(qwen-asr) (base) root@f4c2b35b12bb /workspace/qwen-asr # uv pip list
Package Version
accelerate 1.12.0
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.13.3
aiosignal 1.4.0
annotated-doc 0.0.4
annotated-types 0.7.0
anthropic 0.71.0
anyio 4.12.1
apache-tvm-ffi 0.1.8.post2
astor 0.8.1
attrs 25.4.0
audioread 3.1.0
av 16.1.0
blake3 1.0.8
blinker 1.9.0
brotli 1.2.0
cachetools 6.2.6
cbor2 5.8.0
certifi 2026.1.4
cffi 2.0.0
charset-normalizer 3.4.4
click 8.3.1
cloudpickle 3.1.2
compressed-tensors 0.13.0
cryptography 46.0.4
cuda-bindings 13.1.1
cuda-pathfinder 1.3.3
cuda-python 13.1.1
cupy-cuda12x 13.6.0
cython 3.2.4
decorator 5.2.1
depyf 0.20.0
dill 0.4.1
diskcache 5.6.3
distro 1.9.0
dnspython 2.8.0
docstring-parser 0.17.0
dynet38 2.2
einops 0.8.2
email-validator 2.3.0
fastapi 0.128.0
fastapi-cli 0.0.20
fastapi-cloud-cli 0.11.0
fastar 0.8.0
fastrlock 0.8.3
ffmpy 1.0.0
filelock 3.20.3
flash-attn 2.8.3
flashinfer-python 0.5.3
flask 3.1.2
frozenlist 1.8.0
fsspec 2026.1.0
gguf 0.17.1
gradio 6.5.1
gradio-client 2.0.3
groovy 0.1.2
grpcio 1.76.0
grpcio-reflection 1.76.0
h11 0.16.0
hf-xet 1.2.0
httpcore 1.0.9
httptools 0.7.1
httpx 0.28.1
httpx-sse 0.4.3
huggingface-hub 0.36.0
idna 3.11
ijson 3.4.0.post0
interegular 0.3.3
itsdangerous 2.2.0
jinja2 3.1.6
jiter 0.12.0
jmespath 1.1.0
joblib 1.5.3
jsonschema 4.26.0
jsonschema-specifications 2025.9.1
lark 1.2.2
lazy-loader 0.4
librosa 0.11.0
llguidance 1.3.0
llvmlite 0.44.0
lm-format-enforcer 0.11.3
loguru 0.7.3
markdown-it-py 4.0.0
markupsafe 3.0.3
mcp 1.26.0
mdurl 0.1.2
mistral-common 1.9.0
model-hosting-container-standards 0.1.13
mpmath 1.3.0
msgpack 1.1.2
msgspec 0.20.0
multidict 6.7.1
nagisa 0.2.11
networkx 3.6.1
ninja 1.13.0
numba 0.61.2
numpy 2.2.6
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cudnn-frontend 1.18.0
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-cutlass-dsl 4.3.5
nvidia-ml-py 13.590.48
nvidia-nccl-cu12 2.27.5
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvshmem-cu12 3.3.20
nvidia-nvtx-cu12 12.8.90
openai 2.16.0
openai-harmony 0.0.8
opencv-python-headless 4.13.0.90
orjson 3.11.6
outlines-core 0.2.11
packaging 26.0
pandas 3.0.0
partial-json-parser 0.2.1.1.post7
pillow 12.1.0
platformdirs 4.5.1
pooch 1.8.2
prometheus-client 0.24.1
prometheus-fastapi-instrumentator 7.1.0
propcache 0.4.1
protobuf 6.33.5
psutil 7.2.2
py-cpuinfo 9.0.0
pybase64 1.4.3
pycountry 24.6.1
pycparser 3.0
pydantic 2.12.5
pydantic-core 2.41.5
pydantic-extra-types 2.11.0
pydantic-settings 2.12.0
pydub 0.25.1
pygments 2.19.2
pyjwt 2.10.1
python-dateutil 2.9.0.post0
python-dotenv 1.2.1
python-json-logger 4.0.0
python-multipart 0.0.22
pytz 2025.2
pyyaml 6.0.3
pyzmq 27.1.0
qwen-asr 0.0.4
qwen-omni-utils 0.0.8
ray 2.53.0
referencing 0.37.0
regex 2026.1.15
requests 2.32.5
rich 14.3.1
rich-toolkit 0.17.2
rignore 0.7.6
rpds-py 0.30.0
safehttpx 0.1.7
safetensors 0.7.0
scikit-learn 1.8.0
scipy 1.17.0
semantic-version 2.10.0
sentencepiece 0.2.1
sentry-sdk 2.51.0
setproctitle 1.3.7
setuptools 80.10.2
shellingham 1.5.4
six 1.17.0
sniffio 1.3.1
soundfile 0.13.1
sox 1.5.0
soxr 1.0.0
soynlp 0.0.493
sse-starlette 3.2.0
starlette 0.50.0
supervisor 4.3.0
sympy 1.14.0
tabulate 0.9.0
threadpoolctl 3.6.0
tiktoken 0.12.0
tokenizers 0.22.2
tomlkit 0.13.3
torch 2.9.1
torchaudio 2.9.1
torchvision 0.24.1
tqdm 4.67.1
transformers 4.57.6
triton 3.5.1
typer 0.21.1
typing-extensions 4.15.0
typing-inspection 0.4.2
urllib3 2.6.3
uvicorn 0.40.0
uvloop 0.22.1
vllm 0.14.0
watchfiles 1.1.1
websockets 16.0
werkzeug 3.1.5
xgrammar 0.1.29
yarl 1.22.0
Known Issue