Skip to content

metal : implement q5_0 and q5_1 kernels #3648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 18, 2023
Merged

Conversation

jhen0409
Copy link
Collaborator

@jhen0409 jhen0409 commented Oct 17, 2023

Implement dequantize functions & mul_mv kernels for #3504.

./llama-bench -m models/7B/ggml-model-llama-2-7b-q4_0.gguf \
  -m models/7B/ggml-model-llama-2-7b-q4_1.gguf \
  -m models/7B/ggml-model-llama-2-7b-q5_0.gguf \
  -m models/7B/ggml-model-llama-2-7b-q5_1.gguf \
  -m models/7B/ggml-model-llama-2-7b-q8_0.gguf -t 4

M1 Max (32c GPU):

model size params backend ngl threads test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 4 pp 512 514.94 ± 10.77
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 4 tg 128 61.87 ± 0.69
llama 7B mostly Q4_1 3.95 GiB 6.74 B Metal 99 4 pp 512 515.85 ± 10.63
llama 7B mostly Q4_1 3.95 GiB 6.74 B Metal 99 4 tg 128 57.82 ± 0.57
llama 7B mostly Q5_0 4.33 GiB 6.74 B Metal 99 4 pp 512 450.06 ± 7.63
llama 7B mostly Q5_0 4.33 GiB 6.74 B Metal 99 4 tg 128 42.72 ± 0.32
llama 7B mostly Q5_1 4.72 GiB 6.74 B Metal 99 4 pp 512 440.58 ± 3.19
llama 7B mostly Q5_1 4.72 GiB 6.74 B Metal 99 4 tg 128 42.23 ± 0.39
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 4 pp 512 522.44 ± 9.21
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 4 tg 128 40.29 ± 0.46

M2 (10c GPU):

model size params backend ngl test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 1 pp 512 178.89 ± 0.08
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 1 tg 128 20.85 ± 0.07
llama 7B mostly Q4_1 3.95 GiB 6.74 B Metal 1 pp 512 179.21 ± 0.07
llama 7B mostly Q4_1 3.95 GiB 6.74 B Metal 1 tg 128 18.97 ± 0.06
llama 7B mostly Q5_0 4.33 GiB 6.74 B Metal 1 pp 512 153.31 ± 0.06
llama 7B mostly Q5_0 4.33 GiB 6.74 B Metal 1 tg 128 11.04 ± 2.70
llama 7B mostly Q5_1 4.72 GiB 6.74 B Metal 1 pp 512 115.67 ± 9.05
llama 7B mostly Q5_1 4.72 GiB 6.74 B Metal 1 tg 128 10.87 ± 0.28
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 1 pp 512 154.55 ± 19.90
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 1 tg 128 10.44 ± 0.55

I experience issue like downclocking on my M2 Macbook Air, so this may not be a very accurate result.

build: ad800e8 (1358)

TODO

  • dequantize_q5_0
  • kernel_mul_mv_q5_0_f32
  • dequantize_q5_1
  • kernel_mul_mv_q5_1_f32
  • bench result

@jhen0409 jhen0409 linked an issue Oct 17, 2023 that may be closed by this pull request
@jhen0409 jhen0409 force-pushed the metal-q5 branch 2 times, most recently from e924f6c to 4f87b24 Compare October 17, 2023 04:12
@jhen0409
Copy link
Collaborator Author

jhen0409 commented Oct 18, 2023

It works, but a little bit slow. In addition to the reason of 5-bit extraction, there should be some small optimization for operations.

Updated description for bench result.

@jhen0409 jhen0409 marked this pull request as ready for review October 18, 2023 01:40
@jhen0409 jhen0409 changed the title metal : implement q5_0 / q5_1 kernels metal : implement q5_0 and q5_1 kernels Oct 18, 2023
@ggerganov
Copy link
Member

Thank you, I think even if the performance is not there we can improve it from master.
Can you verify that the perplexity produces reasonable numbers - both with -b 512 and with -b 1?

@jhen0409
Copy link
Collaborator Author

./perplexity -m ./models/ggml-model-llama-2-7b-q5_0.gguf -f ./wikitext-2-raw/wiki.test.raw -t 4 -ngl 1 -b 512

[1]4.1671,[2]4.7427,[3]5.3947,[4]5.9889,[5]6.1096,[6]6.0134,[7]6.1760,[8]6.2610,[9]6.5658,[10]6.7533

./perplexity -m ./models/ggml-model-llama-2-7b-q5_0.gguf -f ./wikitext-2-raw/wiki.test.raw -t 4 -ngl 1 -b 1

[1]4.1672,[2]4.7428,[3]5.3947,[4]5.9890,[5]6.1096,[6]6.0134,[7]6.1761,[8]6.2611,[9]6.5659,[10]6.7534

./perplexity -m ./models/ggml-model-llama-2-7b-q5_1.gguf -f ./wikitext-2-raw/wiki.test.raw -t 4 -ngl 1 -b 512

[1]4.2048,[2]4.7178,[3]5.3836,[4]5.9347,[5]6.0646,[6]5.9762,[7]6.1550,[8]6.2462,[9]6.5789,[10]6.7625

./perplexity -m ./models/ggml-model-llama-2-7b-q5_1.gguf -f ./wikitext-2-raw/wiki.test.raw -t 4 -ngl 1 -b 1

[1]4.2047,[2]4.7177,[3]5.3835,[4]5.9346,[5]6.0644,[6]5.9761,[7]6.1548,[8]6.2461,[9]6.5788,[10]6.7625

Looks good.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a bench on M2 Ultra:

model size params backend ngl threads test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 1 4 pp 512 1235.45 ± 0.51
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 1 4 tg 128 98.16 ± 0.06
llama 7B mostly Q4_1 3.95 GiB 6.74 B Metal 1 4 pp 512 1236.97 ± 1.04
llama 7B mostly Q4_1 3.95 GiB 6.74 B Metal 1 4 tg 128 90.27 ± 0.04
llama 7B mostly Q5_0 4.33 GiB 6.74 B Metal 1 4 pp 512 1067.33 ± 0.74
llama 7B mostly Q5_0 4.33 GiB 6.74 B Metal 1 4 tg 128 78.74 ± 0.07
llama 7B mostly Q5_1 4.72 GiB 6.74 B Metal 1 4 pp 512 1068.07 ± 1.02
llama 7B mostly Q5_1 4.72 GiB 6.74 B Metal 1 4 tg 128 75.76 ± 0.19

@ggerganov ggerganov merged commit c67fe68 into ggml-org:master Oct 18, 2023
@jhen0409 jhen0409 deleted the metal-q5 branch October 18, 2023 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metal : add Q5_0 and Q5_1 support
2 participants