From 73cae1e306173c8e2413f86a38a2b5ef7e0ed0d9 Mon Sep 17 00:00:00 2001 From: BarfingLemurs <128182951+BarfingLemurs@users.noreply.github.com> Date: Sun, 10 Sep 2023 02:18:01 -0400 Subject: [PATCH 1/3] Update README.md --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index fe7391e01d33b..720462063326a 100644 --- a/README.md +++ b/README.md @@ -590,6 +590,11 @@ Several quantization methods are supported. They differ in the resulting model d | 13B | ms/tok @ 8th | - | 73 | 82 | 98 | 105 | 128 | | 13B | bits/weight | 16.0 | 4.5 | 5.0 | 5.5 | 6.0 | 8.5 | +- [k-quants](https://github.com/ggerganov/llama.cpp/pull/1684) +- recent k-quants improvements + - [#2707](https://github.com/ggerganov/llama.cpp/pull/2707) + - [#2807](https://github.com/ggerganov/llama.cpp/pull/2807) + ### Perplexity (measuring model quality) You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better). From 086547ec33f6046d6cb81113455dcc1844be5e2f Mon Sep 17 00:00:00 2001 From: BarfingLemurs <128182951+BarfingLemurs@users.noreply.github.com> Date: Tue, 26 Sep 2023 10:08:18 -0400 Subject: [PATCH 2/3] Update README.md --- examples/perplexity/README.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/examples/perplexity/README.md b/examples/perplexity/README.md index eacfb17c67fb2..50e1af0111dd6 100644 --- a/examples/perplexity/README.md +++ b/examples/perplexity/README.md @@ -1,3 +1,21 @@ # perplexity TODO + +## Llama 2 70B Scorechart +Quantization | Model size (GiB) | Perplexity | Delta to fp16 +-- | -- | -- | -- +Q4_0 | 36.20 | 3.5550 | 3.61% +Q4_1 | 40.20 | 3.5125 | 2.37% +Q5_0 | 44.20 | 3.4744 | 1.26% +Q2_K | 27.27 | 3.7339 | 8.82% +Q3_K_S | 27.86 | 3.7019 | 7.89% +Q3_K_M | 30.83 | 3.5932 | 4.72% +Q3_K_L | 33.67 | 3.5617 | 3.80% +Q4_K_S | 36.39 | 3.4852 | 1.57% +Q4_K_M | 38.54 | 3.4725 | 1.20% +Q5_K_S | 44.20 | 3.4483 | 0.50% +Q5_K_M | 45.41 | 3.4451 | 0.40% +Q6_K | 52.70 | 3.4367 | 0.16% +fp16 | 128.5 | 3.4313 | - + From 043e5b99f230998baa30a3831af1fedcdde08ac0 Mon Sep 17 00:00:00 2001 From: BarfingLemurs <128182951+BarfingLemurs@users.noreply.github.com> Date: Tue, 26 Sep 2023 10:23:45 -0400 Subject: [PATCH 3/3] Update README.md with k-quants bpw measurements --- examples/quantize/README.md | 41 +++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/examples/quantize/README.md b/examples/quantize/README.md index f349e913e3d10..c8b9a27a0b04e 100644 --- a/examples/quantize/README.md +++ b/examples/quantize/README.md @@ -1,3 +1,44 @@ # quantize TODO + +## Llama 2 7B + +Quantization | Bits per Weight (BPW) +-- | -- +Q2_K | 3.35 +Q3_K_S | 3.50 +Q3_K_M | 3.91 +Q3_K_L | 4.27 +Q4_K_S | 4.58 +Q4_K_M | 4.84 +Q5_K_S | 5.52 +Q5_K_M | 5.68 +Q6_K | 6.56 + +## Llama 2 13B +Quantization | Bits per Weight (BPW) +-- | -- +Q2_K | 3.34 +Q3_K_S | 3.48 +Q3_K_M | 3.89 +Q3_K_L | 4.26 +Q4_K_S | 4.56 +Q4_K_M | 4.83 +Q5_K_S | 5.51 +Q5_K_M | 5.67 +Q6_K | 6.56 + +# Llama 2 70B + +Quantization | Bits per Weight (BPW) +-- | -- +Q2_K | 3.40 +Q3_K_S | 3.47 +Q3_K_M | 3.85 +Q3_K_L | 4.19 +Q4_K_S | 4.53 +Q4_K_M | 4.80 +Q5_K_S | 5.50 +Q5_K_M | 5.65 +Q6_K | 6.56