Commit d40fe1c
committed
Add torchao quant for mixtral
Summary:
Similar to sgl-project#1341 we add torchao quantization to mixtral model
Test Plan:
Note: compile is not working yet, and I can't install torchnightly locally and make it work either.
I'll wait for pytorch 2.5 release which happens in mid Oct, or check that again later
python3 -m sglang.bench_latency --model Qwen/Qwen1.5-MoE-A2.7B --batch-size 1 --input 128 --output 8
Warmup ...
Prefill. latency: 0.05532 s, throughput: 2313.73 token/s
Decode. latency: 0.00896 s, throughput: 111.65 token/s
Decode. latency: 0.00833 s, throughput: 120.04 token/s
Decode. latency: 0.00869 s, throughput: 115.06 token/s
Decode. latency: 0.00842 s, throughput: 118.79 token/s
Decode. median latency: 0.00855 s, median throughput: 116.89 token/s
Total. latency: 0.090 s, throughput: 1471.26 token/s
Benchmark ...
Prefill. latency: 0.04294 s, throughput: 2980.61 token/s
Decode. latency: 0.00839 s, throughput: 119.12 token/s
Decode. latency: 0.00828 s, throughput: 120.78 token/s
Decode. latency: 0.00857 s, throughput: 116.64 token/s
Decode. latency: 0.00853 s, throughput: 117.19 token/s
Decode. latency: 0.00859 s, throughput: 116.39 token/s
Decode. median latency: 0.00853 s, median throughput: 117.17 token/s
Total. latency: 0.111 s, throughput: 1226.84 token/s
python3 -m sglang.bench_latency --model Qwen/Qwen1.5-MoE-A2.7B --batch-size 1 --input 128 --output 8 --torchao-config int4wo-128
Warmup ...
Prefill. latency: 0.06413 s, throughput: 1996.05 token/s
Decode. latency: 0.00764 s, throughput: 130.84 token/s
Decode. latency: 0.00748 s, throughput: 133.73 token/s
Decode. latency: 0.00725 s, throughput: 137.84 token/s
Decode. latency: 0.00721 s, throughput: 138.74 token/s
Decode. median latency: 0.00737 s, median throughput: 135.76 token/s
Total. latency: 0.094 s, throughput: 1408.61 token/s
Benchmark ...
Prefill. latency: 0.05239 s, throughput: 2443.43 token/s
Decode. latency: 0.00739 s, throughput: 135.25 token/s
Decode. latency: 0.00720 s, throughput: 138.90 token/s
Decode. latency: 0.00718 s, throughput: 139.21 token/s
Decode. latency: 0.00722 s, throughput: 138.42 token/s
Decode. latency: 0.00745 s, throughput: 134.30 token/s
Decode. median latency: 0.00731 s, median throughput: 136.82 token/s
Total. latency: 0.111 s, throughput: 1223.51 token/s
A100, no compile
python3 -m sglang.bench_latency --model Qwen/Qwen1.5-MoE-A2.7B --batch-size 1 --input 128 --output 8 --torchao-config fp8wo
max_total_num_tokens=199454
Warmup ...
Prefill. latency: 0.06958 s, throughput: 1839.60 token/s
Decode. latency: 0.02343 s, throughput: 42.68 token/s
Decode. latency: 0.02342 s, throughput: 42.70 token/s
Decode. latency: 0.02368 s, throughput: 42.23 token/s
Decode. latency: 0.02337 s, throughput: 42.80 token/s
Decode. median latency: 0.02342 s, median throughput: 42.69 token/s
Total. latency: 0.163 s, throughput: 807.48 token/s
Benchmark ...
Prefill. latency: 0.05767 s, throughput: 2219.36 token/s
Decode. latency: 0.02293 s, throughput: 43.61 token/s
Decode. latency: 0.02026 s, throughput: 49.36 token/s
Decode. latency: 0.02029 s, throughput: 49.29 token/s
Decode. latency: 0.02024 s, throughput: 49.41 token/s
Decode. latency: 0.02026 s, throughput: 49.36 token/s
Decode. median latency: 0.02025 s, median throughput: 49.39 token/s
Total. latency: 0.222 s, throughput: 611.87 token/s
Reviewers:
Subscribers:
Tasks:
Tags:1 parent 70b6802 commit d40fe1c
3 files changed
+45
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
416 | 416 | | |
417 | 417 | | |
418 | 418 | | |
419 | | - | |
420 | | - | |
| 419 | + | |
| 420 | + | |
421 | 421 | | |
422 | 422 | | |
423 | 423 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| 44 | + | |
| 45 | + | |
44 | 46 | | |
45 | 47 | | |
46 | 48 | | |
| |||
296 | 298 | | |
297 | 299 | | |
298 | 300 | | |
| 301 | + | |
299 | 302 | | |
300 | 303 | | |
301 | 304 | | |
| |||
375 | 378 | | |
376 | 379 | | |
377 | 380 | | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
378 | 400 | | |
379 | 401 | | |
380 | 402 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| 50 | + | |
| 51 | + | |
50 | 52 | | |
51 | 53 | | |
52 | 54 | | |
| |||
359 | 361 | | |
360 | 362 | | |
361 | 363 | | |
| 364 | + | |
362 | 365 | | |
363 | 366 | | |
364 | 367 | | |
| |||
450 | 453 | | |
451 | 454 | | |
452 | 455 | | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
453 | 474 | | |
454 | 475 | | |
455 | 476 | | |
0 commit comments