You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -341,13 +341,19 @@ For more detaileds, please refer to the [Deployment Documentation](https://angel
341
341
342
342
### 1. Speculative Decoding
343
343
344
-
#### 1.1 Qwen3 Series Models
344
+
We evaluated the Eagle3 model trained by AngelSlim on tasks including code generation, mathematical reasoning, instruction following, text generation, and multimodal understanding using vLLM. The inference acceleration and context length performance of our trained model under the settings of num_speculative_tokens = 2 or 4 are presented as follows.
We report benchmark results of the Qwen3 series models using the Eagle3 speculative decoding algorithm across multiple evaluation suites, including **MT-bench**, **HumanEval**, **GSM8K**, and **Alpaca**.
349
-
All experiments were conducted on a single NVIDIA H20 GPU with the configuration:
Benchmark results for Qwen3 series models using Eagle3 speculative decoding on vLLM (v0.11.2) across **MT-bench**, **HumanEval**, **GSM8K** and **Alpaca**, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=2, batch_size=1, output_len=1024**).
351
357
352
358
<table>
353
359
<thead>
@@ -379,15 +385,15 @@ All experiments were conducted on a single NVIDIA H20 GPU with the configuration
379
385
<td>378.86</td><td>1</td>
380
386
<td>378.38</td><td>1</td>
381
387
<td>390.53</td><td>1</td>
382
-
<td>318.05</td><td>1</td>
388
+
<td>381.05</td><td>1</td>
383
389
</tr>
384
390
<tr>
385
391
<td>Eagle3</td>
386
392
<td>616.9</td><td>2.13</td>
387
393
<td>653.29</td><td>2.19</td>
388
394
<td>680.1</td><td>2.2</td>
389
395
<td>621.44</td><td>2.17</td>
390
-
<td>642.93</td><td>2.18</td>
396
+
<td>642.93</td><td>2.17</td>
391
397
</tr>
392
398
<!-- Qwen3-4B -->
393
399
<tr>
@@ -483,6 +489,251 @@ All experiments were conducted on a single NVIDIA H20 GPU with the configuration
483
489
</tbody>
484
490
</table>
485
491
492
+
#### 1.2 VLM Models
493
+
494
+
##### 1.2.1 Qwen3-VL Series Models
495
+
496
+
Benchmark results for Qwen3-VL series models using Eagle3 speculative decoding on vLLM (v0.12.0) across language and multimodal tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
497
+
498
+
<table><thead>
499
+
<tr>
500
+
<th>Model</th>
501
+
<th>Method</th>
502
+
<th colspan="2">GSM8K</th>
503
+
<th colspan="2">Alpaca</th>
504
+
<th colspan="2">HumanEval</th>
505
+
<th colspan="2">MT-bench</th>
506
+
<th colspan="2">MATH-500</th>
507
+
<th colspan="2">MMMU</th>
508
+
<th colspan="2">MMStar</th>
509
+
</tr></thead>
510
+
<tbody>
511
+
<tr>
512
+
<td></td>
513
+
<td></td>
514
+
<td>throughput (tokens/s)</td>
515
+
<td>accept length</td>
516
+
<td>throughput (tokens/s)</td>
517
+
<td>accept length</td>
518
+
<td>throughput (tokens/s)</td>
519
+
<td>accept length</td>
520
+
<td>throughput (tokens/s)</td>
521
+
<td>accept length</td>
522
+
<td>throughput (tokens/s)</td>
523
+
<td>accept length</td>
524
+
<td>throughput (tokens/s)</td>
525
+
<td>accept length</td>
526
+
<td>throughput (tokens/s)</td>
527
+
<td>accept length</td>
528
+
</tr>
529
+
<tr>
530
+
<td rowspan="2">Qwen3-VL-2B-Instruct</td>
531
+
<td>Vanilla</td>
532
+
<td>348.55</td>
533
+
<td>1</td>
534
+
<td>350.9</td>
535
+
<td>1</td>
536
+
<td>346.07</td>
537
+
<td>1</td>
538
+
<td>346.31</td>
539
+
<td>1</td>
540
+
<td>82.96</td>
541
+
<td>1</td>
542
+
<td>83.27</td>
543
+
<td>1</td>
544
+
<td>81.63</td>
545
+
<td>1</td>
546
+
</tr>
547
+
<tr>
548
+
<td>Eagle3</td>
549
+
<td>511.52</td>
550
+
<td>2.11</td>
551
+
<td>560.55</td>
552
+
<td>2.26</td>
553
+
<td>826.01</td>
554
+
<td>3.39</td>
555
+
<td>555.22</td>
556
+
<td>2.29</td>
557
+
<td>163.09</td>
558
+
<td>2.57</td>
559
+
<td>154.18</td>
560
+
<td>2.55</td>
561
+
<td>139.73</td>
562
+
<td>2.31</td>
563
+
</tr>
564
+
<tr>
565
+
<td rowspan="2">Qwen3-VL-4B-Instruct</td>
566
+
<td>Vanilla</td>
567
+
<td>212.87</td>
568
+
<td>1</td>
569
+
<td>213.24</td>
570
+
<td>1</td>
571
+
<td>211.69</td>
572
+
<td>1</td>
573
+
<td>212.1</td>
574
+
<td>1</td>
575
+
<td>67.96</td>
576
+
<td>1</td>
577
+
<td>65.88</td>
578
+
<td>1</td>
579
+
<td>67.75</td>
580
+
<td>1</td>
581
+
</tr>
582
+
<tr>
583
+
<td>Eagle3</td>
584
+
<td>415.29</td>
585
+
<td>2.57</td>
586
+
<td>372.89</td>
587
+
<td>2.26</td>
588
+
<td>459.37</td>
589
+
<td>2.82</td>
590
+
<td>382.33</td>
591
+
<td>2.34</td>
592
+
<td>141.87</td>
593
+
<td>2.72</td>
594
+
<td>104.44</td>
595
+
<td>2.05</td>
596
+
<td>107.07</td>
597
+
<td>2.1</td>
598
+
</tr>
599
+
<tr>
600
+
<td rowspan="2">Qwen3-VL-30B-A3B-Instruct</td>
601
+
<td>Vanilla</td>
602
+
<td>179.94</td>
603
+
<td>1</td>
604
+
<td>184.6</td>
605
+
<td>1</td>
606
+
<td>168.68</td>
607
+
<td>1</td>
608
+
<td>180.57</td>
609
+
<td>1</td>
610
+
<td>31.08</td>
611
+
<td>1</td>
612
+
<td>31.51</td>
613
+
<td>1</td>
614
+
<td>30.93</td>
615
+
<td>1</td>
616
+
</tr>
617
+
<tr>
618
+
<td>Eagle3</td>
619
+
<td>281.93</td>
620
+
<td>2.82</td>
621
+
<td>241.42</td>
622
+
<td>2.13</td>
623
+
<td>223.05</td>
624
+
<td>2.57</td>
625
+
<td>240.47</td>
626
+
<td>2.19</td>
627
+
<td>75.31</td>
628
+
<td>2.79</td>
629
+
<td>48.47</td>
630
+
<td>1.78</td>
631
+
<td>52.57</td>
632
+
<td>1.94</td>
633
+
</tr>
634
+
</tbody></table>
635
+
636
+
##### 1.2.2 HunyuanOCR Model
637
+
638
+
Benchmark results for HunyuanOCR using Eagle3 speculative decoding on vLLM (v0.13.0) across OCR tasks, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
639
+
640
+
<table><thead>
641
+
<tr>
642
+
<th>Model</th>
643
+
<th>Method</th>
644
+
<th>OCR-Bench-Internal</th>
645
+
<th></th>
646
+
</tr></thead>
647
+
<tbody>
648
+
<tr>
649
+
<td></td>
650
+
<td></td>
651
+
<td>throughput (tokens/s)</td>
652
+
<td>accept length</td>
653
+
</tr>
654
+
<tr>
655
+
<td>Hunyuan-OCR</td>
656
+
<td>Vanilla</td>
657
+
<td>71.21</td>
658
+
<td>1</td>
659
+
</tr>
660
+
<tr>
661
+
<td></td>
662
+
<td>Eagle3</td>
663
+
<td>120.75</td>
664
+
<td>2.2</td>
665
+
</tr>
666
+
</tbody>
667
+
</table>
668
+
669
+
#### 1.3 Audio Models
670
+
671
+
##### 1.3.1 Qwen2-Audio Model
672
+
673
+
Benchmark results for Qwen2-Audio using Eagle3 speculative decoding on vLLM (v0.12.0) across **[LibriSpeech](https://www.openslr.org/12)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
674
+
675
+
<table><thead>
676
+
<tr>
677
+
<th>Model</th>
678
+
<th>Method</th>
679
+
<th colspan="2">LibriSpeech</th>
680
+
</tr></thead>
681
+
<tbody>
682
+
<tr>
683
+
<td></td>
684
+
<td></td>
685
+
<td>throughput (tokens/s)</td>
686
+
<td>accept length</td>
687
+
</tr>
688
+
<tr>
689
+
<td>Qwen2-Audio-7B-Instruct</td>
690
+
<td>Vanilla</td>
691
+
<td>78.76</td>
692
+
<td>1</td>
693
+
</tr>
694
+
<tr>
695
+
<td></td>
696
+
<td>Eagle3</td>
697
+
<td>146.66</td>
698
+
<td>3.51</td>
699
+
</tr>
700
+
</tbody>
701
+
</table>
702
+
703
+
##### 1.3.2 Fun-CosyVoice3 Model
704
+
705
+
Benchmark results for Fun-CosyVoice3 using Eagle3 speculative decoding across **[LibriTTS](https://www.openslr.org/60/)** dataset, using a single NVIDIA H20 GPU (**tp=1, ep=1, num_speculative_tokens=4, batch_size=1, output_len=1024**).
706
+
707
+
<table><thead>
708
+
<tr>
709
+
<th>Model</th>
710
+
<th>Method</th>
711
+
<th colspan="2">LibriTTS</a></th>
712
+
</tr></thead>
713
+
<tbody>
714
+
<tr>
715
+
<td></td>
716
+
<td></td>
717
+
<td>throughput (tokens/s)</td>
718
+
<td>accept length</td>
719
+
</tr>
720
+
<tr>
721
+
<td>Fun-CosyVoice3</td>
722
+
<td>Vanilla</td>
723
+
<td>-</td>
724
+
<td>1</td>
725
+
</tr>
726
+
<tr>
727
+
<td></td>
728
+
<td>Eagle3</td>
729
+
<td>-</td>
730
+
<td>1.96</td>
731
+
</tr>
732
+
</tbody>
733
+
</table>
734
+
735
+
> Adapted for Transformers backend inference, only displays accept length.
736
+
486
737
### 2. Quantization
487
738
488
739
The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
0 commit comments