metal : implement q5_0 and q5_1 kernels #3648

jhen0409 · 2023-10-17T04:00:36Z

Implement dequantize functions & mul_mv kernels for #3504.

./llama-bench -m models/7B/ggml-model-llama-2-7b-q4_0.gguf \
  -m models/7B/ggml-model-llama-2-7b-q4_1.gguf \
  -m models/7B/ggml-model-llama-2-7b-q5_0.gguf \
  -m models/7B/ggml-model-llama-2-7b-q5_1.gguf \
  -m models/7B/ggml-model-llama-2-7b-q8_0.gguf -t 4

M1 Max (32c GPU):

model	size	params	backend	ngl	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	4	pp 512	514.94 ± 10.77
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	4	tg 128	61.87 ± 0.69
llama 7B mostly Q4_1	3.95 GiB	6.74 B	Metal	99	4	pp 512	515.85 ± 10.63
llama 7B mostly Q4_1	3.95 GiB	6.74 B	Metal	99	4	tg 128	57.82 ± 0.57
llama 7B mostly Q5_0	4.33 GiB	6.74 B	Metal	99	4	pp 512	450.06 ± 7.63
llama 7B mostly Q5_0	4.33 GiB	6.74 B	Metal	99	4	tg 128	42.72 ± 0.32
llama 7B mostly Q5_1	4.72 GiB	6.74 B	Metal	99	4	pp 512	440.58 ± 3.19
llama 7B mostly Q5_1	4.72 GiB	6.74 B	Metal	99	4	tg 128	42.23 ± 0.39
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	4	pp 512	522.44 ± 9.21
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	4	tg 128	40.29 ± 0.46

M2 (10c GPU):

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	1	pp 512	178.89 ± 0.08
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	1	tg 128	20.85 ± 0.07
llama 7B mostly Q4_1	3.95 GiB	6.74 B	Metal	1	pp 512	179.21 ± 0.07
llama 7B mostly Q4_1	3.95 GiB	6.74 B	Metal	1	tg 128	18.97 ± 0.06
llama 7B mostly Q5_0	4.33 GiB	6.74 B	Metal	1	pp 512	153.31 ± 0.06
llama 7B mostly Q5_0	4.33 GiB	6.74 B	Metal	1	tg 128	11.04 ± 2.70
llama 7B mostly Q5_1	4.72 GiB	6.74 B	Metal	1	pp 512	115.67 ± 9.05
llama 7B mostly Q5_1	4.72 GiB	6.74 B	Metal	1	tg 128	10.87 ± 0.28
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	1	pp 512	154.55 ± 19.90
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	1	tg 128	10.44 ± 0.55

I experience issue like downclocking on my M2 Macbook Air, so this may not be a very accurate result.

build: ad800e8 (1358)

TODO

ggml-metal.metal

jhen0409 · 2023-10-18T01:40:09Z

It works, but a little bit slow. In addition to the reason of 5-bit extraction, there should be some small optimization for operations.

Updated description for bench result.

ggerganov · 2023-10-18T06:56:24Z

Thank you, I think even if the performance is not there we can improve it from master.
Can you verify that the perplexity produces reasonable numbers - both with -b 512 and with -b 1?

jhen0409 · 2023-10-18T09:41:41Z

./perplexity -m ./models/ggml-model-llama-2-7b-q5_0.gguf -f ./wikitext-2-raw/wiki.test.raw -t 4 -ngl 1 -b 512

[1]4.1671,[2]4.7427,[3]5.3947,[4]5.9889,[5]6.1096,[6]6.0134,[7]6.1760,[8]6.2610,[9]6.5658,[10]6.7533

./perplexity -m ./models/ggml-model-llama-2-7b-q5_0.gguf -f ./wikitext-2-raw/wiki.test.raw -t 4 -ngl 1 -b 1

[1]4.1672,[2]4.7428,[3]5.3947,[4]5.9890,[5]6.1096,[6]6.0134,[7]6.1761,[8]6.2611,[9]6.5659,[10]6.7534

./perplexity -m ./models/ggml-model-llama-2-7b-q5_1.gguf -f ./wikitext-2-raw/wiki.test.raw -t 4 -ngl 1 -b 512

[1]4.2048,[2]4.7178,[3]5.3836,[4]5.9347,[5]6.0646,[6]5.9762,[7]6.1550,[8]6.2462,[9]6.5789,[10]6.7625

./perplexity -m ./models/ggml-model-llama-2-7b-q5_1.gguf -f ./wikitext-2-raw/wiki.test.raw -t 4 -ngl 1 -b 1

[1]4.2047,[2]4.7177,[3]5.3835,[4]5.9346,[5]6.0644,[6]5.9761,[7]6.1548,[8]6.2461,[9]6.5788,[10]6.7625

Looks good.

ggerganov

Here is a bench on M2 Ultra:

model	size	params	backend	ngl	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	1	4	pp 512	1235.45 ± 0.51
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	1	4	tg 128	98.16 ± 0.06
llama 7B mostly Q4_1	3.95 GiB	6.74 B	Metal	1	4	pp 512	1236.97 ± 1.04
llama 7B mostly Q4_1	3.95 GiB	6.74 B	Metal	1	4	tg 128	90.27 ± 0.04
llama 7B mostly Q5_0	4.33 GiB	6.74 B	Metal	1	4	pp 512	1067.33 ± 0.74
llama 7B mostly Q5_0	4.33 GiB	6.74 B	Metal	1	4	tg 128	78.74 ± 0.07
llama 7B mostly Q5_1	4.72 GiB	6.74 B	Metal	1	4	pp 512	1068.07 ± 1.02
llama 7B mostly Q5_1	4.72 GiB	6.74 B	Metal	1	4	tg 128	75.76 ± 0.19

jhen0409 added 2 commits October 17, 2023 11:55

metal : implement dequantize_q5_0

9c3e05d

metal : block_q_n_dot_y for block_q5_0 (broken)

7ebd4ac

jhen0409 commented Oct 17, 2023

View reviewed changes

ggml-metal.metal Outdated Show resolved Hide resolved

jhen0409 linked an issue Oct 17, 2023 that may be closed by this pull request

metal : add Q5_0 and Q5_1 support #3504

Closed

jhen0409 force-pushed the metal-q5 branch 2 times, most recently from e924f6c to 4f87b24 Compare October 17, 2023 04:12

metal : revert unnecessary change

a7a4887

jhen0409 force-pushed the metal-q5 branch from 4f87b24 to a7a4887 Compare October 17, 2023 04:13

jhen0409 added 2 commits October 17, 2023 13:44

metal : implement dequantize_q5_1

fce44a7

metal : block_q_n_dot_y for q5_1 (broken)

79d4732

ggerganov reviewed Oct 17, 2023

View reviewed changes

ggml-metal.metal Outdated Show resolved Hide resolved

metal : fix block_q_n_dot_y

9db276f

jhen0409 marked this pull request as ready for review October 18, 2023 01:40

jhen0409 changed the title ~~metal : implement q5_0 / q5_1 kernels~~ metal : implement q5_0 and q5_1 kernels Oct 18, 2023

minor : spaces / formatting

7a88522

ggerganov approved these changes Oct 18, 2023

View reviewed changes

ggerganov merged commit c67fe68 into ggml-org:master Oct 18, 2023

jhen0409 deleted the metal-q5 branch October 18, 2023 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : implement q5_0 and q5_1 kernels #3648

metal : implement q5_0 and q5_1 kernels #3648

jhen0409 commented Oct 17, 2023 •

edited

Loading

jhen0409 commented Oct 18, 2023 •

edited

Loading

ggerganov commented Oct 18, 2023

jhen0409 commented Oct 18, 2023

ggerganov left a comment

metal : implement q5_0 and q5_1 kernels #3648

metal : implement q5_0 and q5_1 kernels #3648

Conversation

jhen0409 commented Oct 17, 2023 • edited Loading

TODO

jhen0409 commented Oct 18, 2023 • edited Loading

ggerganov commented Oct 18, 2023

jhen0409 commented Oct 18, 2023

ggerganov left a comment

Choose a reason for hiding this comment

jhen0409 commented Oct 17, 2023 •

edited

Loading

jhen0409 commented Oct 18, 2023 •

edited

Loading