INFO 07-01 03:18:11 [__init__.py:244] Automatically detected platform cuda. INFO 07-01 03:18:23 [api_server.py:1287] vLLM API server version 0.9.1 INFO 07-01 03:18:24 [cli_args.py:309] non-default args: {'port': 8080, 'model': 'Qwen/Qwen2.5-Omni-3B', 'dtype': 'bfloat16', 'allowed_local_media_path': '/', 'served_model_name': ['Qwen2.5-Omni-3B'], 'limit_mm_per_prompt': {'image': 12}} Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'} INFO 07-01 03:19:13 [config.py:823] This model supports multiple tasks: {'generate', 'score', 'embed', 'classify', 'reward'}. Defaulting to 'generate'. INFO 07-01 03:19:15 [api_server.py:265] Started engine process with PID 6927 WARNING 07-01 03:19:24 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234 INFO 07-01 03:19:42 [__init__.py:244] Automatically detected platform cuda. INFO 07-01 03:19:51 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.1) with config: model='Qwen/Qwen2.5-Omni-3B', speculative_config=None, tokenizer='Qwen/Qwen2.5-Omni-3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=Qwen2.5-Omni-3B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":false,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=True, INFO 07-01 03:19:55 [cuda.py:327] Using Flash Attention backend. INFO 07-01 03:19:56 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 INFO 07-01 03:19:56 [model_runner.py:1171] Starting to load model Qwen/Qwen2.5-Omni-3B... INFO 07-01 03:20:00 [weight_utils.py:292] Using model weights format ['*.safetensors'] INFO 07-01 03:20:00 [weight_utils.py:308] Time spent downloading weights for Qwen/Qwen2.5-Omni-3B: 0.591730 seconds Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00 sys.exit(main()) File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 59, in main args.dispatch_function(args) File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 58, in cmd uvloop.run(run_server(args)) File "/home/jun.zhou10/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/jun.zhou10/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1323, in run_server await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1343, in run_server_worker async with build_async_engine_client(args, client_config) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 155, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/jun.zhou10/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 288, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause.