-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Description
Feature request
https://github.com/turboderp/exllamav2
Motivation
Overview of differences compared to V1
Faster, better kernels
Cleaner and more versatile codebase
Support for a new quant format
| Model | Mode | Size | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090 |
|---|---|---|---|---|---|---|---|---|
| Llama | GPTQ | 7B | 128 | no | 143 t/s | 173 t/s | 175 t/s | 195 t/s |
| Llama | GPTQ | 13B | 128 | no | 84 t/s | 102 t/s | 105 t/s | 110 t/s |
| Llama | GPTQ | 33B | 128 | yes | 37 t/s | 45 t/s | 45 t/s | 48 t/s |
| OpenLlama | GPTQ | 3B | 128 | yes | 194 t/s | 226 t/s | 295 t/s | 321 t/s |
Your contribution
I could take a look to actual exllama implementation and what it takes to upgrade, if wanted
alexanderfrey, Ichigo3766, Vokturz, Pyroserenus, edwardzjl and 10 more
Metadata
Metadata
Assignees
Labels
No labels