Skip to content

Commit 20bcf6f

Browse files
authored
add Speculative Decoding BenchMark of VLM & Audio Models (#193)
1 parent 5dfc30d commit 20bcf6f

File tree

3 files changed

+516
-11
lines changed

3 files changed

+516
-11
lines changed

README.md

Lines changed: 259 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ A more accessible, comprehensive, and efficient toolkit for large model compress
180180
</td>
181181
<td>
182182
<ul style="padding-left: 0; list-style-position: inside;">
183-
<li>Under Development</li>
183+
<li><a href="https://angelslim.readthedocs.io/zh-cn/latest/features/speculative_decoding/eagle.html">Eagle3</a></li>
184184
</ul>
185185
</td>
186186
<td>
@@ -341,13 +341,19 @@ For more detaileds, please refer to the [Deployment Documentation](https://angel
341341

342342
### 1. Speculative Decoding
343343

344-
#### 1.1 Qwen3 Series Models
344+
We evaluated the Eagle3 model trained by AngelSlim on tasks including code generation, mathematical reasoning, instruction following, text generation, and multimodal understanding using vLLM. The inference acceleration and context length performance of our trained model under the settings of num_speculative_tokens = 2 or 4 are presented as follows.
345+
346+
<p align="center">
347+
<picture>
348+
<source media="(prefers-color-scheme: dark)" srcset="./docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png">
349+
<img alt="AngelSlim" src="./docs/source/assets/speculative_decoding/eagle3_speedup_and_accepted_length.png" width=100%>
350+
</picture>
351+
</p>
352+
345353

346-
**vLLM v0.11.2 Benchmark Results**
354+
#### 1.1 Qwen3 Series Models
347355

348-
We report benchmark results of the Qwen3 series models using the Eagle3 speculative decoding algorithm across multiple evaluation suites, including **MT-bench**, **HumanEval**, **GSM8K**, and **Alpaca**.
349-
All experiments were conducted on a single NVIDIA H20 GPU with the configuration:
350-
**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**.
356+
Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**).
351357

352358
<table>
353359
<thead>
@@ -379,15 +385,15 @@ All experiments were conducted on a single NVIDIA H20 GPU with the configuration
379385
<td>378.86</td><td>1</td>
380386
<td>378.38</td><td>1</td>
381387
<td>390.53</td><td>1</td>
382-
<td>318.05</td><td>1</td>
388+
<td>381.05</td><td>1</td>
383389
</tr>
384390
<tr>
385391
<td>Eagle3</td>
386392
<td>616.9</td><td>2.13</td>
387393
<td>653.29</td><td>2.19</td>
388394
<td>680.1</td><td>2.2</td>
389395
<td>621.44</td><td>2.17</td>
390-
<td>642.93</td><td>2.18</td>
396+
<td>642.93</td><td>2.17</td>
391397
</tr>
392398
<!-- Qwen3-4B -->
393399
<tr>
@@ -483,6 +489,251 @@ All experiments were conducted on a single NVIDIA H20 GPU with the configuration
483489
</tbody>
484490
</table>
485491

492+
#### 1.2 VLM Models
493+
494+
##### 1.2.1 Qwen3-VL Series Models
495+
496+
Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
497+
498+
<table><thead>
499+
<tr>
500+
<th>Model</th>
501+
<th>Method</th>
502+
<th colspan="2">GSM8K</th>
503+
<th colspan="2">Alpaca</th>
504+
<th colspan="2">HumanEval</th>
505+
<th colspan="2">MT-bench</th>
506+
<th colspan="2">MATH-500</th>
507+
<th colspan="2">MMMU</th>
508+
<th colspan="2">MMStar</th>
509+
</tr></thead>
510+
<tbody>
511+
<tr>
512+
<td></td>
513+
<td></td>
514+
<td>throughput (tokens/s)</td>
515+
<td>accept length</td>
516+
<td>throughput (tokens/s)</td>
517+
<td>accept length</td>
518+
<td>throughput (tokens/s)</td>
519+
<td>accept length</td>
520+
<td>throughput (tokens/s)</td>
521+
<td>accept length</td>
522+
<td>throughput (tokens/s)</td>
523+
<td>accept length</td>
524+
<td>throughput (tokens/s)</td>
525+
<td>accept length</td>
526+
<td>throughput (tokens/s)</td>
527+
<td>accept length</td>
528+
</tr>
529+
<tr>
530+
<td rowspan="2">Qwen3-VL-2B-Instruct</td>
531+
<td>Vanilla</td>
532+
<td>348.55</td>
533+
<td>1</td>
534+
<td>350.9</td>
535+
<td>1</td>
536+
<td>346.07</td>
537+
<td>1</td>
538+
<td>346.31</td>
539+
<td>1</td>
540+
<td>82.96</td>
541+
<td>1</td>
542+
<td>83.27</td>
543+
<td>1</td>
544+
<td>81.63</td>
545+
<td>1</td>
546+
</tr>
547+
<tr>
548+
<td>Eagle3</td>
549+
<td>511.52</td>
550+
<td>2.11</td>
551+
<td>560.55</td>
552+
<td>2.26</td>
553+
<td>826.01</td>
554+
<td>3.39</td>
555+
<td>555.22</td>
556+
<td>2.29</td>
557+
<td>163.09</td>
558+
<td>2.57</td>
559+
<td>154.18</td>
560+
<td>2.55</td>
561+
<td>139.73</td>
562+
<td>2.31</td>
563+
</tr>
564+
<tr>
565+
<td rowspan="2">Qwen3-VL-4B-Instruct</td>
566+
<td>Vanilla</td>
567+
<td>212.87</td>
568+
<td>1</td>
569+
<td>213.24</td>
570+
<td>1</td>
571+
<td>211.69</td>
572+
<td>1</td>
573+
<td>212.1</td>
574+
<td>1</td>
575+
<td>67.96</td>
576+
<td>1</td>
577+
<td>65.88</td>
578+
<td>1</td>
579+
<td>67.75</td>
580+
<td>1</td>
581+
</tr>
582+
<tr>
583+
<td>Eagle3</td>
584+
<td>415.29</td>
585+
<td>2.57</td>
586+
<td>372.89</td>
587+
<td>2.26</td>
588+
<td>459.37</td>
589+
<td>2.82</td>
590+
<td>382.33</td>
591+
<td>2.34</td>
592+
<td>141.87</td>
593+
<td>2.72</td>
594+
<td>104.44</td>
595+
<td>2.05</td>
596+
<td>107.07</td>
597+
<td>2.1</td>
598+
</tr>
599+
<tr>
600+
<td rowspan="2">Qwen3-VL-30B-A3B-Instruct</td>
601+
<td>Vanilla</td>
602+
<td>179.94</td>
603+
<td>1</td>
604+
<td>184.6</td>
605+
<td>1</td>
606+
<td>168.68</td>
607+
<td>1</td>
608+
<td>180.57</td>
609+
<td>1</td>
610+
<td>31.08</td>
611+
<td>1</td>
612+
<td>31.51</td>
613+
<td>1</td>
614+
<td>30.93</td>
615+
<td>1</td>
616+
</tr>
617+
<tr>
618+
<td>Eagle3</td>
619+
<td>281.93</td>
620+
<td>2.82</td>
621+
<td>241.42</td>
622+
<td>2.13</td>
623+
<td>223.05</td>
624+
<td>2.57</td>
625+
<td>240.47</td>
626+
<td>2.19</td>
627+
<td>75.31</td>
628+
<td>2.79</td>
629+
<td>48.47</td>
630+
<td>1.78</td>
631+
<td>52.57</td>
632+
<td>1.94</td>
633+
</tr>
634+
</tbody></table>
635+
636+
##### 1.2.2 HunyuanOCR Model
637+
638+
Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across OCR tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
639+
640+
<table><thead>
641+
<tr>
642+
<th>Model</th>
643+
<th>Method</th>
644+
<th>OCR-Bench-Internal</th>
645+
<th></th>
646+
</tr></thead>
647+
<tbody>
648+
<tr>
649+
<td></td>
650+
<td></td>
651+
<td>throughput (tokens/s)</td>
652+
<td>accept length</td>
653+
</tr>
654+
<tr>
655+
<td>Hunyuan-OCR</td>
656+
<td>Vanilla</td>
657+
<td>71.21</td>
658+
<td>1</td>
659+
</tr>
660+
<tr>
661+
<td></td>
662+
<td>Eagle3</td>
663+
<td>120.75</td>
664+
<td>2.2</td>
665+
</tr>
666+
</tbody>
667+
</table>
668+
669+
#### 1.3 Audio Models
670+
671+
##### 1.3.1 Qwen2-Audio Model
672+
673+
Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
674+
675+
<table><thead>
676+
<tr>
677+
<th>Model</th>
678+
<th>Method</th>
679+
<th colspan="2">LibriSpeech</th>
680+
</tr></thead>
681+
<tbody>
682+
<tr>
683+
<td></td>
684+
<td></td>
685+
<td>throughput (tokens/s)</td>
686+
<td>accept length</td>
687+
</tr>
688+
<tr>
689+
<td>Qwen2-Audio-7B-Instruct</td>
690+
<td>Vanilla</td>
691+
<td>78.76</td>
692+
<td>1</td>
693+
</tr>
694+
<tr>
695+
<td></td>
696+
<td>Eagle3</td>
697+
<td>146.66</td>
698+
<td>3.51</td>
699+
</tr>
700+
</tbody>
701+
</table>
702+
703+
##### 1.3.2 Fun-CosyVoice3 Model
704+
705+
Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
706+
707+
<table><thead>
708+
<tr>
709+
<th>Model</th>
710+
<th>Method</th>
711+
<th colspan="2">LibriTTS</a></th>
712+
</tr></thead>
713+
<tbody>
714+
<tr>
715+
<td></td>
716+
<td></td>
717+
<td>throughput (tokens/s)</td>
718+
<td>accept length</td>
719+
</tr>
720+
<tr>
721+
<td>Fun-CosyVoice3</td>
722+
<td>Vanilla</td>
723+
<td>-</td>
724+
<td>1</td>
725+
</tr>
726+
<tr>
727+
<td></td>
728+
<td>Eagle3</td>
729+
<td>-</td>
730+
<td>1.96</td>
731+
</tr>
732+
</tbody>
733+
</table>
734+
735+
> Adapted for Transformers backend inference, only displays accept length.
736+
486737
### 2. Quantization
487738

488739
The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)

0 commit comments

Comments
 (0)