Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7fc96c1b79c0>, trust_remote_code=False, seed=0, num_prompts=1000, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=True, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=40000, random_output_len=300, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', header=None, max_concurrency=None, model='/cloud/oss_checkpoints/zai-org/GLM-4.7-FP8', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=1.0, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=True, save_detailed=True, append_result=False, metadata=None, result_dir='./vllm_bench_results/test/', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,e2el', metric_percentiles='50,90,95,99', goodput=None, request_id_prefix='bench-b8ace93c-', top_p=None, top_k=None, min_p=None, temperature=None, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False)
INFO 04-01 10:30:56 [datasets.py:700] Sampling input_len from [40000, 40000] and output_len from [300, 300]
WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0.
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: 1.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|██████████████████████████████████████████████████████████████████████████████████| 1000/1000 [56:59<00:00, 3.42s/it]
Failed requests during benchmark run detected (capping to 10):
Error 0: Traceback (most recent call last):
File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 205, in async_request_openai_completions
messages = handler.add_chunk(chunk_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 32, in add_chunk
chunk_str = chunk_bytes.decode("utf-8")
^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 32760: unexpected end of data
Error 1: Traceback (most recent call last):
File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 205, in async_request_openai_completions
messages = handler.add_chunk(chunk_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 32, in add_chunk
chunk_str = chunk_bytes.decode("utf-8")
^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 32760: unexpected end of data
Error 2: Traceback (most recent call last):
File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 205, in async_request_openai_completions
messages = handler.add_chunk(chunk_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 32, in add_chunk
chunk_str = chunk_bytes.decode("utf-8")
^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 65482: unexpected end of data
Error 3: Traceback (most recent call last):
File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 205, in async_request_openai_completions
messages = handler.add_chunk(chunk_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 32, in add_chunk
chunk_str = chunk_bytes.decode("utf-8")
^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 65527-65528: unexpected end of data
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 996
Failed requests: 4
Request rate configured (RPS): 1.00
Benchmark duration (s): 3419.64
Total input tokens: 39840000
Total generated tokens: 298800
Request throughput (req/s): 0.29
Output token throughput (tok/s): 87.38
Peak output token throughput (tok/s): 296.00
Peak concurrent requests: 711.00
Total token throughput (tok/s): 11737.74
---------------Time to First Token----------------
Mean TTFT (ms): 1206448.85
Median TTFT (ms): 1200351.31
P50 TTFT (ms): 1200351.31
P90 TTFT (ms): 2173032.14
P95 TTFT (ms): 2298578.44
P99 TTFT (ms): 2386296.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 81.88
Median TPOT (ms): 83.55
P50 TPOT (ms): 83.55
P90 TPOT (ms): 101.78
P95 TPOT (ms): 103.87
P99 TPOT (ms): 111.89
----------------End-to-end Latency----------------
Mean E2EL (ms): 1230931.18
Median E2EL (ms): 1225871.19
P50 E2EL (ms): 1225871.19
P90 E2EL (ms): 2197145.25
P95 E2EL (ms): 2324417.13
P99 E2EL (ms): 2410427.41
---------------Speculative Decoding---------------
Acceptance rate (%): 19.66
Acceptance length: 1.20
Drafts: 249112
Draft tokens: 249112
Accepted tokens: 48984
Per-position acceptance (%):
Position 0: 19.66
==================================================
Your current environment
The output of
python collect_env.py🐛 Describe the bug
Serve GLM-4.7-FP8 with:
Run Bench serve in another terminal
Got results with 4
UnicodeDecodeError:Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7fc96c1b79c0>, trust_remote_code=False, seed=0, num_prompts=1000, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=True, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=40000, random_output_len=300, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', header=None, max_concurrency=None, model='/cloud/oss_checkpoints/zai-org/GLM-4.7-FP8', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=1.0, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=True, save_detailed=True, append_result=False, metadata=None, result_dir='./vllm_bench_results/test/', result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,e2el', metric_percentiles='50,90,95,99', goodput=None, request_id_prefix='bench-b8ace93c-', top_p=None, top_k=None, min_p=None, temperature=None, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, lora_assignment='random', ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False, plot_timeline=False, timeline_itl_thresholds=[25.0, 50.0], plot_dataset_stats=False) INFO 04-01 10:30:56 [datasets.py:700] Sampling input_len from [40000, 40000] and output_len from [300, 300] WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0. Starting initial single prompt test run... Skipping endpoint ready check. Starting main benchmark run... Traffic request rate: 1.0 Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|██████████████████████████████████████████████████████████████████████████████████| 1000/1000 [56:59<00:00, 3.42s/it] Failed requests during benchmark run detected (capping to 10): Error 0: Traceback (most recent call last): File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 205, in async_request_openai_completions messages = handler.add_chunk(chunk_bytes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 32, in add_chunk chunk_str = chunk_bytes.decode("utf-8") ^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 32760: unexpected end of data Error 1: Traceback (most recent call last): File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 205, in async_request_openai_completions messages = handler.add_chunk(chunk_bytes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 32, in add_chunk chunk_str = chunk_bytes.decode("utf-8") ^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 32760: unexpected end of data Error 2: Traceback (most recent call last): File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 205, in async_request_openai_completions messages = handler.add_chunk(chunk_bytes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 32, in add_chunk chunk_str = chunk_bytes.decode("utf-8") ^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 65482: unexpected end of data Error 3: Traceback (most recent call last): File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 205, in async_request_openai_completions messages = handler.add_chunk(chunk_bytes) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opensource/guohong/vllm/vllm/benchmarks/lib/endpoint_request_func.py", line 32, in add_chunk chunk_str = chunk_bytes.decode("utf-8") ^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 65527-65528: unexpected end of data tip: install termplotlib and gnuplot to plot the metrics ============ Serving Benchmark Result ============ Successful requests: 996 Failed requests: 4 Request rate configured (RPS): 1.00 Benchmark duration (s): 3419.64 Total input tokens: 39840000 Total generated tokens: 298800 Request throughput (req/s): 0.29 Output token throughput (tok/s): 87.38 Peak output token throughput (tok/s): 296.00 Peak concurrent requests: 711.00 Total token throughput (tok/s): 11737.74 ---------------Time to First Token---------------- Mean TTFT (ms): 1206448.85 Median TTFT (ms): 1200351.31 P50 TTFT (ms): 1200351.31 P90 TTFT (ms): 2173032.14 P95 TTFT (ms): 2298578.44 P99 TTFT (ms): 2386296.33 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 81.88 Median TPOT (ms): 83.55 P50 TPOT (ms): 83.55 P90 TPOT (ms): 101.78 P95 TPOT (ms): 103.87 P99 TPOT (ms): 111.89 ----------------End-to-end Latency---------------- Mean E2EL (ms): 1230931.18 Median E2EL (ms): 1225871.19 P50 E2EL (ms): 1225871.19 P90 E2EL (ms): 2197145.25 P95 E2EL (ms): 2324417.13 P99 E2EL (ms): 2410427.41 ---------------Speculative Decoding--------------- Acceptance rate (%): 19.66 Acceptance length: 1.20 Drafts: 249112 Draft tokens: 249112 Accepted tokens: 48984 Per-position acceptance (%): Position 0: 19.66 ==================================================Failed to reproduce #37587 and #37599
This error was reproduced several times with the same Bench Serve command.
Before submitting a new issue...