-
Notifications
You must be signed in to change notification settings - Fork 11.9k
metal : implement q5_0 and q5_1 kernels #3648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e924f6c
to
4f87b24
Compare
It works, but a little bit slow. In addition to the reason of 5-bit extraction, there should be some small optimization for operations. Updated description for bench result. |
Thank you, I think even if the performance is not there we can improve it from |
Looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a bench on M2 Ultra:
model | size | params | backend | ngl | threads | test | t/s |
---|---|---|---|---|---|---|---|
llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 1 | 4 | pp 512 | 1235.45 ± 0.51 |
llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | Metal | 1 | 4 | tg 128 | 98.16 ± 0.06 |
llama 7B mostly Q4_1 | 3.95 GiB | 6.74 B | Metal | 1 | 4 | pp 512 | 1236.97 ± 1.04 |
llama 7B mostly Q4_1 | 3.95 GiB | 6.74 B | Metal | 1 | 4 | tg 128 | 90.27 ± 0.04 |
llama 7B mostly Q5_0 | 4.33 GiB | 6.74 B | Metal | 1 | 4 | pp 512 | 1067.33 ± 0.74 |
llama 7B mostly Q5_0 | 4.33 GiB | 6.74 B | Metal | 1 | 4 | tg 128 | 78.74 ± 0.07 |
llama 7B mostly Q5_1 | 4.72 GiB | 6.74 B | Metal | 1 | 4 | pp 512 | 1068.07 ± 1.02 |
llama 7B mostly Q5_1 | 4.72 GiB | 6.74 B | Metal | 1 | 4 | tg 128 | 75.76 ± 0.19 |
Implement dequantize functions & mul_mv kernels for #3504.
M1 Max (32c GPU):
M2 (10c GPU):
build: ad800e8 (1358)
TODO