Skip to content

Commit b9fd7ee

Browse files
ggerganovsw
andauthored
ggml : remove bit shuffling (#1405)
* ggml : remove Q4_0 bit shufling (ARM NEON) * ggml : remove Q4_1 bit shuffling (ARM NEON + reference) * ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON) * ggml : remove Q4_2 bit shuffling (WIP, BROKEN) * ggml : remove Q5_0 bit shuffling (ARM NEON) * ggml : 2x faster scalar implementations * ggml : remove Q5_1 bit shuffling (ARM NEON + scalar) * ggml : simplify scalar dot * ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit * ggml : fix Q4_1 quantization * ggml : update cuBLAS + normalize variable names * ggml : remove Q4_2 mode * ggml : minor formatting * ggml : fix Q5_0 quantization * scripts : add script for measuring the time per token * AVX implementations (#1370) * ggml : uniform 5th bit extraction * llama : produce error upon loading old model files * llama : fix model magic/version write * ggml : speed-up Q5_0 + Q5_1 at 4 threads * ggml : preserve old Q4 and Q5 formats * ggml : simplify Q8_1 - no need for low / high sums anymore * ggml : fix Q8_0 and Q8_1 rounding * Revert "AVX implementations (#1370)" This reverts commit 948d124. * ggml : fix AVX2 implementation * sha : update hashes for 7B and 13B * readme : update timings + remove warning banner * llama : update v2 PR number to 1405 * ggml : fix WASM comments * ggml : back to original bit order * readme : add note that Q4 and Q5 have been changed * llama : fix return for unknown version --------- Co-authored-by: Stephan Walter <[email protected]>
1 parent b608b55 commit b9fd7ee

File tree

12 files changed

+669
-1706
lines changed

12 files changed

+669
-1706
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,5 +44,6 @@ zig-cache/
4444

4545
ppl-*.txt
4646
qnt-*.txt
47+
perf-*.txt
4748

4849
examples/jeopardy/results.txt

README.md

Lines changed: 13 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -7,18 +7,10 @@
77

88
Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
99

10-
## ⚠️ TEMPORARY NOTICE ABOUT UPCOMING BREAKING CHANGE ⚠️
11-
12-
**The quantization formats will soon be updated: https://github.com/ggerganov/llama.cpp/pull/1305**
13-
14-
**All `ggml` model files using the old format will not work with the latest `llama.cpp` code after that change is merged**
15-
16-
---
17-
1810
**Hot topics:**
1911

12+
- Qauntization formats `Q4` and `Q5` have changed - requantize any old models [(info)](https://github.com/ggerganov/llama.cpp/pull/1405)
2013
- [Roadmap May 2023](https://github.com/ggerganov/llama.cpp/discussions/1220)
21-
- [New quantization methods](https://github.com/ggerganov/llama.cpp#quantization)
2214

2315
<details>
2416
<summary>Table of Contents</summary>
@@ -338,18 +330,18 @@ As the models are currently fully loaded into memory, you will need adequate dis
338330
339331
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
340332
341-
| Model | Measure | F16 | Q4_0 | Q4_1 | Q4_2 | Q5_0 | Q5_1 | Q8_0 |
342-
|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|-------:|
343-
| 7B | perplexity | 5.9066 | 6.1620 | 6.0910 | 6.1466 | 5.9862 | 5.9481 | 5.9069 |
344-
| 7B | file size | 13.0G | 4.0G | 4.8G | 4.0G | 4.4G | 4.8G | 7.1G |
345-
| 7B | ms/tok @ 4th | 128 | 56 | 61 | 84 | 91 | 95 | 75 |
346-
| 7B | ms/tok @ 8th | 128 | 47 | 55 | 48 | 53 | 59 | 75 |
347-
| 7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
348-
| 13B | perplexity | 5.2543 | 5.3863 | 5.3607 | 5.3513 | 5.2856 | 5.2706 | 5.2548 |
349-
| 13B | file size | 25.0G | 7.6G | 9.1G | 7.6G | 8.4G | 9.1G | 14G |
350-
| 13B | ms/tok @ 4th | 239 | 104 | 113 | 160 | 176 | 185 | 141 |
351-
| 13B | ms/tok @ 8th | 240 | 85 | 99 | 97 | 108 | 117 | 147 |
352-
| 13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
333+
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
334+
|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|
335+
| 7B | perplexity | 5.9066 | 6.1620 | 6.0910 | 5.9862 | 5.9481 | 5.9069 |
336+
| 7B | file size | 13.0G | 4.0G | 4.8G | 4.4G | 4.8G | 7.1G |
337+
| 7B | ms/tok @ 4th | 128 | 50 | 54 | 75 | 83 | 75 |
338+
| 7B | ms/tok @ 8th | 123 | 44 | 52 | 53 | 58 | 72 |
339+
| 7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0 |
340+
| 13B | perplexity | 5.2543 | 5.3863 | 5.3607 | 5.2856 | 5.2706 | 5.2548 |
341+
| 13B | file size | 25.0G | 7.6G | 9.1G | 8.4G | 9.1G | 14G |
342+
| 13B | ms/tok @ 4th | 239 | 93 | 101 | 150 | 164 | 141 |
343+
| 13B | ms/tok @ 8th | 240 | 81 | 96 | 96 | 104 | 136 |
344+
| 13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0 |
353345
354346
### Perplexity (measuring model quality)
355347

SHA256SUMS

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,27 @@
11
700df0d3013b703a806d2ae7f1bfb8e59814e3d06ae78be0c66368a50059f33d models/7B/consolidated.00.pth
22
666a4bb533b303bdaf89e1b6a3b6f93535d868de31d903afdc20983dc526c847 models/7B/ggml-model-f16.bin
3-
99aeb35f26b577fa2732716cca4d8b5ada39a78ea9b2dca2651fc632b5d101b6 models/7B/ggml-model-q4_0.bin
4-
cc061458339a3eb8bcecbf0a825e9924fb7d1a8150f63cd5d091caa99215aafe models/7B/ggml-model-q4_1.bin
5-
25b050337a87344da687a7f2adddc03bd99b7f6c140450e836649f3585fb6496 models/7B/ggml-model-q4_2.bin
3+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/7B/ggml-model-q4_0.bin
4+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/7B/ggml-model-q4_1.bin
5+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/7B/ggml-model-q5_0.bin
6+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/7B/ggml-model-q5_1.bin
67
7e89e242ddc0dd6f060b43ca219ce8b3e8f08959a72cb3c0855df8bb04d46265 models/7B/params.json
78
745bf4e29a4dd6f411e72976d92b452da1b49168a4f41c951cfcc8051823cf08 models/13B/consolidated.00.pth
89
d5ccbcc465c71c0de439a5aeffebe8344c68a519bce70bc7f9f92654ee567085 models/13B/consolidated.01.pth
910
2b206e9b21fb1076f11cafc624e2af97c9e48ea09312a0962153acc20d45f808 models/13B/ggml-model-f16.bin
10-
eecb575d325d935157761172e2bf05984dad216eb2b06777b73463cf9b818bab models/13B/ggml-model-q4_0.bin
11-
d9581b5b88e5622532fe897c9f9b0e67a317d22dd27a6f90fa4ab8c6d23ccdbb models/13B/ggml-model-q4_1.bin
12-
75a218a47df03f5f96354656329864613abcb67779412b9bc2282b28c1c3cbaa models/13B/ggml-model-q4_2.bin
11+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/13B/ggml-model-q4_0.bin
12+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/13B/ggml-model-q4_1.bin
13+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/13B/ggml-model-q5_0.bin
14+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/13B/ggml-model-q5_1.bin
1315
4ab77bec4d4405ccb66a97b282574c89a94417e3c32e5f68f37e2876fc21322f models/13B/params.json
1416
e23294a58552d8cdec5b7e8abb87993b97ea6eced4178ff2697c02472539d067 models/30B/consolidated.00.pth
1517
4e077b7136c7ae2302e954860cf64930458d3076fcde9443f4d0e939e95903ff models/30B/consolidated.01.pth
1618
24a87f01028cbd3a12de551dcedb712346c0b5cbdeff1454e0ddf2df9b675378 models/30B/consolidated.02.pth
1719
1adfcef71420886119544949767f6a56cb6339b4d5fcde755d80fe68b49de93b models/30B/consolidated.03.pth
1820
7e1b524061a9f4b27c22a12d6d2a5bf13b8ebbea73e99f218809351ed9cf7d37 models/30B/ggml-model-f16.bin
19-
517b9e525742c42b5478a6280a4b41ec66f46298c57aba7f0453d491682fe42d models/30B/ggml-model-q4_0.bin
20-
7b75ac615fa369ee593493a7e6ef87542bf0350255db928b22c5a24f6d598bcd models/30B/ggml-model-q4_1.bin
21-
aadbc9cf806313a55be570f62884eed289d30c313fac3b7838717e01bd553204 models/30B/ggml-model-q4_2.bin
21+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/30B/ggml-model-q4_0.bin
22+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/30B/ggml-model-q4_1.bin
23+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/30B/ggml-model-q5_0.bin
24+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/30B/ggml-model-q5_1.bin
2225
2c07118ea98d69dbe7810d88520e30288fa994751b337f8fca02b171955f44cb models/30B/params.json
2326
135c563f6b3938114458183afb01adc9a63bef3d8ff7cccc3977e5d3664ecafe models/65B/consolidated.00.pth
2427
9a600b37b19d38c7e43809485f70d17d1dc12206c07efa83bc72bb498a568bde models/65B/consolidated.01.pth
@@ -29,8 +32,9 @@ a287c0dfe49081626567c7fe87f74cce5831f58e459b427b5e05567641f47b78 models/65B/con
2932
72b4eba67a1a3b18cb67a85b70f8f1640caae9b40033ea943fb166bd80a7b36b models/65B/consolidated.06.pth
3033
d27f5b0677d7ff129ceacd73fd461c4d06910ad7787cf217b249948c3f3bc638 models/65B/consolidated.07.pth
3134
60758f2384d74e423dffddfd020ffed9d3bb186ebc54506f9c4a787d0f5367b0 models/65B/ggml-model-f16.bin
32-
01672072136f8be6ca9d7cebe5f86ed316e8b85851b9fe3de951809233cea4f2 models/65B/ggml-model-q4_0.bin
33-
4743a28aac3e5f32a6e838a815f51d3779de44fbbe251d745251e66c23c5950f models/65B/ggml-model-q4_1.bin
34-
1b6f6588d0e2ecfe6c4d849088e48e5e3083466b962daa32e3261363e21fc5e9 models/65B/ggml-model-q4_2.bin
35+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/65B/ggml-model-q4_0.bin
36+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/65B/ggml-model-q4_1.bin
37+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/65B/ggml-model-q5_0.bin
38+
ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff models/65B/ggml-model-q5_1.bin
3539
999ed1659b469ccc2a941714c0a9656fa571d17c9f7c8c7589817ca90edef51b models/65B/params.json
3640
9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 models/tokenizer.model

examples/quantize/quantize.cpp

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,11 @@
77
#include <string>
88

99
static const std::map<std::string, llama_ftype> LLAMA_FTYPE_MAP = {
10-
{"q4_0", LLAMA_FTYPE_MOSTLY_Q4_0},
11-
{"q4_1", LLAMA_FTYPE_MOSTLY_Q4_1},
12-
{"q4_2", LLAMA_FTYPE_MOSTLY_Q4_2},
13-
{"q5_0", LLAMA_FTYPE_MOSTLY_Q5_0},
14-
{"q5_1", LLAMA_FTYPE_MOSTLY_Q5_1},
15-
{"q8_0", LLAMA_FTYPE_MOSTLY_Q8_0},
10+
{"q4_0", LLAMA_FTYPE_MOSTLY_Q4_0},
11+
{"q4_1", LLAMA_FTYPE_MOSTLY_Q4_1},
12+
{"q5_0", LLAMA_FTYPE_MOSTLY_Q5_0},
13+
{"q5_1", LLAMA_FTYPE_MOSTLY_Q5_1},
14+
{"q8_0", LLAMA_FTYPE_MOSTLY_Q8_0},
1615
};
1716

1817
bool try_parse_ftype(const std::string & ftype_str, llama_ftype & ftype, std::string & ftype_str_out) {

ggml-cuda.cu

Lines changed: 36 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -49,13 +49,6 @@ typedef struct {
4949
} block_q4_1;
5050
static_assert(sizeof(block_q4_1) == sizeof(float) * 2 + QK4_1 / 2, "wrong q4_1 block size/padding");
5151

52-
#define QK4_2 16
53-
typedef struct {
54-
half d; // delta
55-
uint8_t qs[QK4_2 / 2]; // nibbles / quants
56-
} block_q4_2;
57-
static_assert(sizeof(block_q4_2) == sizeof(ggml_fp16_t) + QK4_2 / 2, "wrong q4_2 block size/padding");
58-
5952
#define QK5_0 32
6053
typedef struct {
6154
half d; // delta
@@ -81,147 +74,102 @@ typedef struct {
8174
static_assert(sizeof(block_q8_0) == sizeof(float) + QK8_0, "wrong q8_0 block size/padding");
8275

8376
static __global__ void dequantize_block_q4_0(const void * vx, float * y) {
77+
static const int qk = QK4_0;
78+
8479
const block_q4_0 * x = (const block_q4_0 *) vx;
8580

8681
const int i = blockIdx.x;
8782

8883
const float d = x[i].d;
8984

90-
const uint8_t * pp = x[i].qs;
91-
92-
for (int l = 0; l < QK4_0; l += 2) {
93-
const uint8_t vi = pp[l/2];
94-
95-
const int8_t vi0 = vi & 0xf;
96-
const int8_t vi1 = vi >> 4;
85+
for (int j = 0; j < qk/2; ++j) {
86+
const int x0 = (x[i].qs[j] & 0xf) - 8;
87+
const int x1 = (x[i].qs[j] >> 4) - 8;
9788

98-
const float v0 = (vi0 - 8)*d;
99-
const float v1 = (vi1 - 8)*d;
100-
101-
y[i*QK4_0 + l + 0] = v0;
102-
y[i*QK4_0 + l + 1] = v1;
89+
y[i*qk + j + 0 ] = x0*d;
90+
y[i*qk + j + qk/2] = x1*d;
10391
}
10492
}
10593

10694
static __global__ void dequantize_block_q4_1(const void * vx, float * y) {
95+
static const int qk = QK4_1;
96+
10797
const block_q4_1 * x = (const block_q4_1 *) vx;
10898

10999
const int i = blockIdx.x;
110100

111101
const float d = x[i].d;
112102
const float m = x[i].m;
113103

114-
const uint8_t * pp = x[i].qs;
115-
116-
for (int l = 0; l < QK4_1; l += 2) {
117-
const uint8_t vi = pp[l/2];
118-
119-
const int8_t vi0 = vi & 0xf;
120-
const int8_t vi1 = vi >> 4;
104+
for (int j = 0; j < qk/2; ++j) {
105+
const int x0 = (x[i].qs[j] & 0xf);
106+
const int x1 = (x[i].qs[j] >> 4);
121107

122-
const float v0 = vi0*d + m;
123-
const float v1 = vi1*d + m;
124-
125-
y[i*QK4_1 + l + 0] = v0;
126-
y[i*QK4_1 + l + 1] = v1;
127-
}
128-
}
129-
130-
static __global__ void dequantize_block_q4_2(const void * vx, float * y) {
131-
const block_q4_2 * x = (const block_q4_2 *) vx;
132-
133-
const int i = blockIdx.x;
134-
135-
const float d = x[i].d;
136-
137-
const uint8_t * pp = x[i].qs;
138-
139-
for (int l = 0; l < QK4_2; l += 2) {
140-
const uint8_t vi = pp[l/2];
141-
142-
const int8_t vi0 = vi & 0xf;
143-
const int8_t vi1 = vi >> 4;
144-
145-
const float v0 = (vi0 - 8)*d;
146-
const float v1 = (vi1 - 8)*d;
147-
148-
y[i*QK4_2 + l + 0] = v0;
149-
y[i*QK4_2 + l + 1] = v1;
108+
y[i*qk + j + 0 ] = x0*d + m;
109+
y[i*qk + j + qk/2] = x1*d + m;
150110
}
151111
}
152112

153113
static __global__ void dequantize_block_q5_0(const void * vx, float * y) {
114+
static const int qk = QK5_0;
115+
154116
const block_q5_0 * x = (const block_q5_0 *) vx;
155117

156118
const int i = blockIdx.x;
157119

158120
const float d = x[i].d;
159121

160-
const uint8_t * pp = x[i].qs;
161-
162122
uint32_t qh;
163123
memcpy(&qh, x[i].qh, sizeof(qh));
164124

165-
for (int l = 0; l < QK5_0; l += 2) {
166-
const uint8_t vi = pp[l/2];
167-
168-
const int8_t vh0 = ((qh & (1 << (l + 0))) >> (l + 0)) << 4;
169-
const int8_t vh1 = ((qh & (1 << (l + 1))) >> (l + 1)) << 4;
125+
for (int j = 0; j < qk/2; ++j) {
126+
const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
127+
const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10;
170128

171-
const int8_t vi0 = ((vi & 0xf) | vh0);
172-
const int8_t vi1 = ((vi >> 4) | vh1);
129+
const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16;
130+
const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16;
173131

174-
const float v0 = (vi0 - 16)*d;
175-
const float v1 = (vi1 - 16)*d;
176-
177-
y[i*QK5_0 + l + 0] = v0;
178-
y[i*QK5_0 + l + 1] = v1;
132+
y[i*qk + j + 0 ] = x0*d;
133+
y[i*qk + j + qk/2] = x1*d;
179134
}
180135
}
181136

182137
static __global__ void dequantize_block_q5_1(const void * vx, float * y) {
138+
static const int qk = QK5_1;
139+
183140
const block_q5_1 * x = (const block_q5_1 *) vx;
184141

185142
const int i = blockIdx.x;
186143

187144
const float d = x[i].d;
188145
const float m = x[i].m;
189146

190-
const uint8_t * pp = x[i].qs;
191-
192147
uint32_t qh;
193148
memcpy(&qh, x[i].qh, sizeof(qh));
194149

195-
for (int l = 0; l < QK5_1; l += 2) {
196-
const uint8_t vi = pp[l/2];
197-
198-
const int8_t vh0 = ((qh & (1 << (l + 0))) >> (l + 0)) << 4;
199-
const int8_t vh1 = ((qh & (1 << (l + 1))) >> (l + 1)) << 4;
150+
for (int j = 0; j < qk/2; ++j) {
151+
const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10;
152+
const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10;
200153

201-
const int8_t vi0 = (vi & 0xf) | vh0;
202-
const int8_t vi1 = (vi >> 4) | vh1;
154+
const int x0 = (x[i].qs[j] & 0xf) | xh_0;
155+
const int x1 = (x[i].qs[j] >> 4) | xh_1;
203156

204-
const float v0 = vi0*d + m;
205-
const float v1 = vi1*d + m;
206-
207-
y[i*QK5_1 + l + 0] = v0;
208-
y[i*QK5_1 + l + 1] = v1;
157+
y[i*qk + j + 0 ] = x0*d + m;
158+
y[i*qk + j + qk/2] = x1*d + m;
209159
}
210160
}
211161

212162
static __global__ void dequantize_block_q8_0(const void * vx, float * y) {
163+
static const int qk = QK8_0;
164+
213165
const block_q8_0 * x = (const block_q8_0 *) vx;
214166

215167
const int i = blockIdx.x;
216168

217169
const float d = x[i].d;
218170

219-
const int8_t * pp = x[i].qs;
220-
221-
for (int l = 0; l < QK8_0; l++) {
222-
const int8_t vi = pp[l];
223-
224-
y[i*QK8_0 + l] = vi*d;
171+
for (int j = 0; j < qk; ++j) {
172+
y[i*qk + j] = x[i].qs[j]*d;
225173
}
226174
}
227175

@@ -235,11 +183,6 @@ static void dequantize_row_q4_1_cuda(const void * vx, float * y, int k, cudaStre
235183
dequantize_block_q4_1<<<nb, 1, 0, stream>>>(vx, y);
236184
}
237185

238-
static void dequantize_row_q4_2_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
239-
const int nb = k / QK4_2;
240-
dequantize_block_q4_2<<<nb, 1, 0, stream>>>(vx, y);
241-
}
242-
243186
static void dequantize_row_q5_0_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
244187
const int nb = k / QK5_0;
245188
dequantize_block_q5_0<<<nb, 1, 0, stream>>>(vx, y);
@@ -274,8 +217,6 @@ static to_fp32_cuda_t ggml_get_to_fp32_cuda(ggml_type type) {
274217
return dequantize_row_q4_0_cuda;
275218
case GGML_TYPE_Q4_1:
276219
return dequantize_row_q4_1_cuda;
277-
case GGML_TYPE_Q4_2:
278-
return dequantize_row_q4_2_cuda;
279220
case GGML_TYPE_Q5_0:
280221
return dequantize_row_q5_0_cuda;
281222
case GGML_TYPE_Q5_1:

0 commit comments

Comments
 (0)