DeepSeek-R1-0528 ik quants! #477
Replies: 25 comments 218 replies
-
Thanks for these quants and the rest of your work you publish. Could you do one that fits in 128GB RAM and 72GB VRAM with 32K context? I tried the unsloth IQ1_S and got about 2.7 t/s generation on mainline and 2.15 t/s on ik. It was coherent and delivered surprisingly good responses to real world coding tasks. Oh but the R4 variants don't support Q1 yet, right? |
Beta Was this translation helpful? Give feedback.
-
Will -rtr fix the R4 quants so they don't have to use the BF16 path? I downloaded IQ1_S from unsloth and got 90t/s PP but same and slightly lower 10.x t/s output. So IQ2_XXS from previous V3 is not much Smaller AMB than 512 often lets you fit a couple more pieces due to the reduced buffer. Every little bit on GPU helps when CPU/Memory isn't that strong. |
Beta Was this translation helpful? Give feedback.
-
I had an interesting report from huggingface.co/ciprianv that compiling with I tried it out myself and confirmed with Interestingly it does suggest that for some hardware configurations it may be beneficial to PP to compile with 👈 Methodology and Data logsCompilation flags with and without FORCE_BF16. cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build ./build --config Release -j $(nproc)
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=0 llama-sweep-bench test CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
--model "$model" \
-mla 3 -fa \
-amb 512 \
-fmoe \
-ctk f16 \
-c 16384 \
-ngl 99 \
-ot "blk\.(3|4|5|6|7|8|9)\.ffn_.*=CUDA0" \ # <--- with or without extra layers offloaded to GPU
-ot exps=CPU \
--warmup-batch \
--no-mmap \
--threads 24
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q5_0: 61 tensors
llama_model_loader: - type iq4_ks: 116 tensors
llama_model_loader: - type iq5_ks: 435 tensors
llama_model_loader: - type iq2_k_r4: 116 tensors
llama_model_loader: - type iq3_k_r4: 58 tensors
|
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 10.448 | 49.00 | 8.496 | 15.07 |
512 | 128 | 512 | 10.626 | 48.18 | 8.548 | 14.97 |
512 | 128 | 1024 | 10.704 | 47.83 | 8.601 | 14.88 |
512 | 128 | 1536 | 10.803 | 47.39 | 9.029 | 14.18 |
512 | 128 | 2048 | 10.938 | 46.81 | 8.645 | 14.81 |
512 | 128 | 2560 | 10.983 | 46.62 | 8.789 | 14.56 |
512 | 128 | 3072 | 11.132 | 46.00 | 8.824 | 14.51 |
512 | 128 | 3584 | 11.152 | 45.91 | 8.845 | 14.47 |
512 | 128 | 4096 | 11.285 | 45.37 | 9.060 | 14.13 |
512 | 128 | 4608 | 11.432 | 44.79 | 8.842 | 14.48 |
512 | 128 | 5120 | 11.415 | 44.85 | 8.893 | 14.39 |
512 | 128 | 5632 | 11.542 | 44.36 | 9.071 | 14.11 |
512 | 128 | 6144 | 11.605 | 44.12 | 9.085 | 14.09 |
512 | 128 | 6656 | 11.719 | 43.69 | 9.258 | 13.83 |
512 | 128 | 7168 | 11.851 | 43.20 | 9.104 | 14.06 |
512 | 128 | 7680 | 11.884 | 43.08 | 9.115 | 14.04 |
512 | 128 | 8192 | 12.052 | 42.48 | 9.434 | 13.57 |
-DGGML_CUDA_IQK_FORCE_BF16=1 -ot exps=CPU
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 11.488 | 44.57 | 8.968 | 14.27 |
512 | 128 | 512 | 11.665 | 43.89 | 8.923 | 14.34 |
512 | 128 | 1024 | 11.746 | 43.59 | 8.912 | 14.36 |
512 | 128 | 1536 | 11.841 | 43.24 | 9.110 | 14.05 |
512 | 128 | 2048 | 11.981 | 42.73 | 8.966 | 14.28 |
512 | 128 | 2560 | 12.023 | 42.58 | 9.144 | 14.00 |
512 | 128 | 3072 | 12.112 | 42.27 | 9.216 | 13.89 |
512 | 128 | 3584 | 12.257 | 41.77 | 9.215 | 13.89 |
512 | 128 | 4096 | 12.323 | 41.55 | 9.224 | 13.88 |
512 | 128 | 4608 | 12.452 | 41.12 | 9.191 | 13.93 |
512 | 128 | 5120 | 12.512 | 40.92 | 9.220 | 13.88 |
512 | 128 | 5632 | 12.555 | 40.78 | 9.378 | 13.65 |
512 | 128 | 6144 | 12.695 | 40.33 | 9.354 | 13.68 |
512 | 128 | 6656 | 12.822 | 39.93 | 9.480 | 13.50 |
512 | 128 | 7168 | 12.829 | 39.91 | 9.454 | 13.54 |
512 | 128 | 7680 | 12.937 | 39.58 | 9.502 | 13.47 |
512 | 128 | 8192 | 13.148 | 38.94 | 9.604 | 13.33 |
512 | 128 | 8704 | 13.142 | 38.96 | 9.626 | 13.30 |
512 | 128 | 9216 | 13.268 | 38.59 | 9.758 | 13.12 |
512 | 128 | 9728 | 13.410 | 38.18 | 9.604 | 13.33 |
512 | 128 | 10240 | 13.429 | 38.13 | 9.613 | 13.32 |
512 | 128 | 10752 | 13.522 | 37.87 | 9.856 | 12.99 |
512 | 128 | 11264 | 13.653 | 37.50 | 9.790 | 13.08 |
512 | 128 | 11776 | 13.780 | 37.15 | 9.779 | 13.09 |
512 | 128 | 12288 | 13.772 | 37.18 | 9.825 | 13.03 |
512 | 128 | 12800 | 13.886 | 36.87 | 10.041 | 12.75 |
512 | 128 | 13312 | 14.037 | 36.47 | 9.906 | 12.92 |
512 | 128 | 13824 | 14.078 | 36.37 | 10.013 | 12.78 |
512 | 128 | 14336 | 14.178 | 36.11 | 10.172 | 12.58 |
512 | 128 | 14848 | 14.289 | 35.83 | 10.043 | 12.74 |
512 | 128 | 15360 | 14.406 | 35.54 | 9.980 | 12.83 |
512 | 128 | 15872 | 14.414 | 35.52 | 10.023 | 12.77 |
-DGGML_CUDA_IQK_FORCE_BF16=0 -ot exps=CPU
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 12.572 | 40.73 | 8.800 | 14.55 |
512 | 128 | 512 | 12.639 | 40.51 | 8.911 | 14.36 |
512 | 128 | 1024 | 12.810 | 39.97 | 9.140 | 14.00 |
512 | 128 | 1536 | 12.985 | 39.43 | 8.942 | 14.31 |
512 | 128 | 2048 | 12.998 | 39.39 | 9.217 | 13.89 |
512 | 128 | 2560 | 13.119 | 39.03 | 9.378 | 13.65 |
512 | 128 | 3072 | 13.247 | 38.65 | 9.137 | 14.01 |
512 | 128 | 3584 | 13.293 | 38.52 | 9.186 | 13.93 |
512 | 128 | 4096 | 13.488 | 37.96 | 9.341 | 13.70 |
512 | 128 | 4608 | 13.496 | 37.94 | 9.235 | 13.86 |
512 | 128 | 5120 | 13.522 | 37.86 | 9.405 | 13.61 |
512 | 128 | 5632 | 13.695 | 37.39 | 9.388 | 13.63 |
512 | 128 | 6144 | 13.716 | 37.33 | 9.352 | 13.69 |
512 | 128 | 6656 | 13.905 | 36.82 | 9.530 | 13.43 |
512 | 128 | 7168 | 13.911 | 36.80 | 9.413 | 13.60 |
512 | 128 | 7680 | 14.024 | 36.51 | 9.630 | 13.29 |
512 | 128 | 8192 | 14.210 | 36.03 | 9.601 | 13.33 |
512 | 128 | 8704 | 14.277 | 35.86 | 9.595 | 13.34 |
512 | 128 | 9216 | 14.361 | 35.65 | 9.571 | 13.37 |
512 | 128 | 9728 | 14.438 | 35.46 | 9.798 | 13.06 |
512 | 128 | 10240 | 14.577 | 35.12 | 9.717 | 13.17 |
512 | 128 | 10752 | 14.605 | 35.06 | 9.887 | 12.95 |
512 | 128 | 11264 | 14.683 | 34.87 | 10.044 | 12.74 |
512 | 128 | 11776 | 14.881 | 34.41 | 9.796 | 13.07 |
512 | 128 | 12288 | 14.909 | 34.34 | 9.840 | 13.01 |
512 | 128 | 12800 | 14.982 | 34.18 | 9.832 | 13.02 |
512 | 128 | 13312 | 15.094 | 33.92 | 10.101 | 12.67 |
512 | 128 | 13824 | 15.219 | 33.64 | 10.060 | 12.72 |
512 | 128 | 14336 | 15.265 | 33.54 | 10.282 | 12.45 |
512 | 128 | 14848 | 15.333 | 33.39 | 10.172 | 12.58 |
512 | 128 | 15360 | 15.493 | 33.05 | 9.979 | 12.83 |
512 | 128 | 15872 | 15.553 | 32.92 | 9.987 | 12.82 |
Beta Was this translation helpful? Give feedback.
-
I uploaded the custom quant I use for coding here with some of the infromation how I arrived there and relevant benchmarks. I added some teasers on command line arguments to experiment with, as this branch is moving quickly and small performance improvements can add up over time. |
Beta Was this translation helpful? Give feedback.
-
Qantization Effects of
|
Beta Was this translation helpful? Give feedback.
-
So here is a new surprise, since I'm eying that IQ1 quant you're publishing. On a lark I turned off the -rtr switch and in unsloth's quant, it was cutting my prompt processing by half. It did buff textgen to over 11t/s though. The mind wobbles. Will try reloading the larger quant of V3 to check results. On Qwens it sped things up 100% On another note, I tried to test mainline llama and that sweep bench segfaults with deepseek and does not recognize the -FA parameter. I was able to load on llama-server and get a blazing fast 6t/s PP, 6t/s TG. So much for that. |
Beta Was this translation helpful? Give feedback.
-
Still struggling to understand some things. ✔ All tensors on CPU With Q4_K_M With Q4_K_M (IQ4_K_R4 Either way I have a whole GPU worth of compute just sitting idle. There has to be a way to utilize it. Can I not have the |
Beta Was this translation helpful? Give feedback.
-
Can anyone explain to me in simple terms. When considering tensor offload configurations, what exactly is the nature of the stickiness or entanglement between tensors? What tensors MUST go together as an indivisible unit? ✔ All tensors on CPU FACT: attention and exps can be separated between CPU and GPU. But, you want to do something like this? ✘ attn=CUDA0, blk.3.=CUDA1 exps=CPU Are these impossible for REASONS or just "not supported" i.e. go learn the domain and write the code myself? |
Beta Was this translation helpful? Give feedback.
-
Day 4 of chasing performance with bespoke repacking and the delicate and mercurial (i.e. broken) configuration args. I'm ready to give up. I tried so many blends of tensor offload parameters and statically repacking my head is spinning. Nothing I tried can reach the high water marks of: I made a repacked quant that converts only the exps tensors running on CPU to _r4 (exps 11...60) and run everything else on CUDA0 and CUDA1 with --sm layer. It should be the best of both worlds, but it's the worst of both worlds: PP 71 and TG 9. The domain may seem like black magic but at the end of the day all we're doing here is matrix multiplication. My instinct is screaming at me that there's huge amounts of performance left on the table. The wild and frankly shocking comment that "high gpu utilization is actually a bad thing" notwithstanding, the goal is to get the most math done per unit time as possible. It's very telling that seemingly no one can give an explanation that holds water of what operations must be tied to one another on a compute device, or why the tensors can be split in one way between CPU and CUDA0 but as soon as you extend the split to involve CUDA1 the performance bombs. We want to run big models on commodity hardware and that means finding the way of distributing the computation among multiple relatively-low-capacity compute units that maximizes the contribution of all the units. |
Beta Was this translation helpful? Give feedback.
-
I am running on 1x epyc 9334qs + 12x ddr5 6400mhz(works on 4800mhz) 48g + 3070 16g, ~10.3t/s TG, ~78t/s PP, it works well, but the VRAM has used about 12GB, I am not sure how large a context window( model: unsloth/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-Q4_K_M-00001-of-00009.gguf
|
Beta Was this translation helpful? Give feedback.
-
Add -b 4096 -ub 4096 and you will have 3x your pp speed |
Beta Was this translation helpful? Give feedback.
-
So I finally cooked a quant after sitting on the BF16 for so long. I ended up going with @ubergarm's imatrix with: Running sweep right now but early impressions are good enough that I may end up using this for a while before attempting some more mixes. (PP seems a bit better, TG seems about the same) (As a reminder the quant I end up settling on for V3-0324 was a very simple |
Beta Was this translation helpful? Give feedback.
-
Thank you for the discussion. Sharing my experimental results for your reference. |
Beta Was this translation helpful? Give feedback.
-
PP (Prompt Processing) speed in ik_llama.cpp is significantly faster than in standard llama.cpp. The biggest challenge in offline DeepSeek deployment is PP performance. Compared to enterprise-grade Prefill/Decode (PD)-separated architectures that deliver robust PP and TG performance, single-machine deployments (for individuals/small teams) struggle with long-context (>10K token) processing due to suboptimal PP efficiency. From my perspective: If GPU VRAM can double PP performance, it’s maximizing resource utilization. Using VRAM to host sparsely activated expert weights (at only ~4% utilization rate) seems wasteful. The 270 tokens/s at 16384 batch size represents the peak PP performance I achieved after exhaustive tuning of configurations, CPU/GPU combinations, and offline DeepSeek deployment setups. I sincerely apologize again for my earlier statements. llama.cpp ik llama.cpp ik llama.cpp -mla 3 -fmoe -amb 512/1024 Screenshot evidence will be attached as noted. |
Beta Was this translation helpful? Give feedback.
-
Turns out I was using ik llama.cpp incorrectly all along. |
Beta Was this translation helpful? Give feedback.
-
I can see what I can do, but I don't feel particularly motivated to engage in hunting down integer overflows and CUDA maximum block size exceeded issues in code that I didn't write myself or at least modified at some point. There are still some performance optimizations left that would be more interesting to work on. But based on your performance numbers, I estimate you have a 30 GB/s PCI-E, so it takes about 13 seconds to upload all experts stored in RAM to the GPU(s). For u-batch size of 16k tokens you are getting 347 t/s, so the u-batch takes about 47 seconds, so computation is about 34 seconds (and it is easy to verify that this napkin math works for u-batches of 8k and 4k). If you would go to u-batch size of 32k tokens, computation for the batch will at least double, offload time will stay the same, so it will be taking about 81 seconds, so performance will be in the range of 390 t/s. In reality when batch sizes become very large, computing performance goes down due to limited caches, etc, so I'm guessing you will saturate around 350-360 t/s. If I look at the 8k u-batch size, I estimate you have in the range of 30 GB of unused VRAM. Hence, you could have uploaded 5 or 6 layers of experts to the GPU. That would slightly increase your PP performance, and will also boost your TG performance by about 10%. |
Beta Was this translation helpful? Give feedback.
-
Just a couple benchmark dumps. Compiled with
Compiled with
Do not really see a difference with Hardware
|
Beta Was this translation helpful? Give feedback.
-
MOVED: #258 (comment) |
Beta Was this translation helpful? Give feedback.
-
Lets update the perplexity vs llm size graph. I suggest we use svg. [EDIT]: this is an old graph. But it contains the code of the generator in the details. The latest version of the code will be here but the most up-to-date graphs could be elsewhere. For example, for the Deepseek-R1-0528 its here: #477 (comment) with qr codes [following[ to the huggingface['s short-version domain name hf.co (to save on the QR data)]: [INSTRUCTIONS TO GENETATE SVG]
To generate the svg graph perplexity vs llm size keep the data in config.json:
and use the make.sh script ( ./make.sh --logscale config.json > ppl-log.svg ) to generate the svg file: #!/usr/bin/env bash
set -euo pipefail
# ------------------------------------------------------------------
# 1. CLI
# ------------------------------------------------------------------
logscale=0
[[ $# -ge 1 && $1 == "--logscale" ]] && { logscale=1; shift; }
[[ $# -eq 1 ]] || { echo "Usage: $0 [--logscale] config.json" >&2; exit 1; }
config=$1
[[ -f $config ]] || { echo "Config file not found" >&2; exit 1; }
# ------------------------------------------------------------------
# 2. QR-codes (never touch stdin)
# ------------------------------------------------------------------
qr_dir="qrcodes"
mkdir -p "$qr_dir"
while IFS= read -r url; do
[[ -z $url ]] && continue
short=${url//https:\/\/huggingface.co\//hf.co/}
hash=$(printf '%s' "$short" | md5sum | awk '{print $1}')
file="$qr_dir/$hash.svg"
[[ -f $file ]] && continue
tmp="$qr_dir/${hash}_tmp.svg"
qrencode --inline -t svg -l L -s 1 -m 0 "$short" -o "$tmp"
svgo --multipass -q "$tmp" -o "$file" 2>/dev/null
rm -f "$tmp"
done < <(jq -r '.data[] | select(.url) | .url' "$config")
# ------------------------------------------------------------------
# 3. Pre-compute .size and limits
# ------------------------------------------------------------------
mp=$(jq -r '.model_parameters' "$config")
min_ppl=$(jq -r '.data | min_by(.ppl).ppl' "$config")
max_ppl=$(jq -r '.data | max_by(.ppl).ppl' "$config")
sizes=()
while IFS= read -r size; do
sizes+=("$size")
done < <(jq -r --arg mp "$mp" '.data[] | .bpw * ($mp|tonumber) / 8 / 1024 / 1024 / 1024 | round * 1.0' "$config")
max_sz=0
for size in "${sizes[@]}"; do
if (( $(echo "$size > $max_sz" | bc -l) )); then
max_sz=$size
fi
done
max_round=$(awk -v m="$max_sz" 'BEGIN{r=int((m+63)/64)*64; print (r<64?64:r)}')
title=$(jq -r '.title // "Quantization Analysis"' "$config")
subtitle=$(jq -r '.subtitle // "Lower perplexity = better"' "$config")
[[ $logscale -eq 1 ]] && subtitle+=" (log-difference scale)"
if [[ $logscale -eq 1 ]]; then
rng=$(awk -v min="$min_ppl" -v max="$max_ppl" 'BEGIN{print max-min}')
eps=$(awk -v r="$rng" 'BEGIN{print r/100}')
t_min=$(awk -v e="$eps" 'BEGIN{print log(e)/log(10)}')
t_range=$(awk -v min="$min_ppl" -v max="$max_ppl" -v e="$eps" \
'BEGIN{tmax=log(max-min+e)/log(10); print tmax-log(e)/log(10)}')
else
ppl_rng=$(awk -v min="$min_ppl" -v max="$max_ppl" 'BEGIN{print max-min}')
fi
# ------------------------------------------------------------------
# 4. Pareto indices
# ------------------------------------------------------------------
pareto_i=()
item_count=$(jq '.data | length' "$config")
for ((i=0; i<item_count; i++)); do
item=$(jq --argjson i "$i" '.data[$i]' "$config")
bpw=$(jq -r '.bpw' <<<"$item")
ppl=$(jq -r '.ppl' <<<"$item")
size=$(bc <<< "scale=4; $bpw * $mp / 8 / 1024 / 1024 / 1024")
size=$(printf "%.1f" "$size")
is_pareto=1
for ((j=0; j<item_count; j++)); do
[[ $j -eq $i ]] && continue
j_item=$(jq --argjson j "$j" '.data[$j]' "$config")
j_bpw=$(jq -r '.bpw' <<<"$j_item")
j_ppl=$(jq -r '.ppl' <<<"$j_item")
j_size=$(bc <<< "scale=4; $j_bpw * $mp / 8 / 1024 / 1024 / 1024")
j_size=$(printf "%.1f" "$j_size")
if (( $(echo "$j_ppl <= $ppl" | bc -l) && $(echo "$j_size <= $size" | bc -l) )); then
if (( $(echo "$j_ppl < $ppl" | bc -l) || $(echo "$j_size < $size" | bc -l) )); then
is_pareto=0
break
fi
fi
done
[[ $is_pareto -eq 1 ]] && pareto_i+=("$i")
done
# ------------------------------------------------------------------
# 5. SVG header & grid
# ------------------------------------------------------------------
top=100; h=400; gap=50
leg_h=$((70 + item_count * 40))
tot=$((top + h + gap + leg_h + 50))
cat <<EOF
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 800 $tot">
<defs>
<radialGradient id="halo" cx="50%" cy="50%" r="50%">
<stop offset="0%" stop-color="#00c853" stop-opacity="0.20"/>
<stop offset="100%" stop-color="#00c853" stop-opacity="0"/>
</radialGradient>
</defs>
<style>
.axis{stroke:#555;stroke-width:1.5}
.grid{stroke:#eee;stroke-width:.5}
.label{font:14px sans-serif;fill:#333}
.title{font:bold 18px sans-serif;fill:#111}
.triangle{stroke-width:1.5}
.legend-item{font:12px monospace}
.legend-title{font:bold 13px sans-serif}
</style>
<rect width="100%" height="100%" fill="white"/>
<text class="title" x="400" y="30" text-anchor="middle">$title</text>
<text class="label" x="400" y="55" text-anchor="middle" fill="#666">$subtitle</text>
<line x1="100" y1="100" x2="100" y2="500" class="axis"/>
<line x1="100" y1="500" x2="700" y2="500" class="axis"/>
<text class="label" x="400" y="540" text-anchor="middle">Model Size (GB)</text>
<text class="label" x="20" y="300" text-anchor="middle" transform="rotate(-90,20,300)">Perplexity (lower is better)</text>
EOF
# Y-axis grid
if [[ $logscale -eq 1 ]]; then
for i in {0..4}; do
frac=$(awk -v i="$i" 'BEGIN{print i/4}')
ppl=$(awk -v min="$min_ppl" -v eps="$eps" -v tr="$t_range" -v f="$frac" \
'BEGIN{printf "%.3f", min + 10**(log(eps)/log(10) + f*tr) - eps}')
y=$(awk -v f="$frac" 'BEGIN{printf "%.1f",500-400*f}')
text_y=$(awk -v y="$y" 'BEGIN{printf "%.1f", y+5}')
echo " <line x1=\"100\" y1=\"$y\" x2=\"700\" y2=\"$y\" class=\"grid\"/>"
echo " <text class=\"label\" x=\"80\" y=\"$text_y\" text-anchor=\"end\">$ppl</text>"
done
else
for i in {0..4}; do
ppl=$(awk -v min="$min_ppl" -v max="$max_ppl" -v i="$i" -v r="$ppl_rng" \
'BEGIN{printf "%.1f",max-i*r/4}')
y=$((100+i*100))
text_y=$((y+5))
echo " <line x1=\"100\" y1=\"$y\" x2=\"700\" y2=\"$y\" class=\"grid\"/>"
echo " <text class=\"label\" x=\"80\" y=\"$text_y\" text-anchor=\"end\">$ppl</text>"
done
fi
# X-axis grid
for i in $(seq 0 64 "$max_round"); do
x=$(awk -v s="$i" -v mr="$max_round" 'BEGIN{printf "%.1f",100+(s/mr)*600}')
echo " <line x1=\"$x\" y1=\"100\" x2=\"$x\" y2=\"500\" class=\"grid\"/>"
[[ $((i%256)) -eq 0 || $i -eq $max_round ]] && \
echo " <text class=\"label\" x=\"$x\" y=\"520\" text-anchor=\"middle\">$i</text>"
done
# ------------------------------------------------------------------
# 6. Helpers
# ------------------------------------------------------------------
to_xy() {
local sz=$1 pl=$2 x y
x=$(awk -v s="$sz" -v mr="$max_round" 'BEGIN{printf "%.1f",100+(s/mr)*600}')
if [[ $logscale -eq 1 ]]; then
y=$(awk -v p="$pl" -v min="$min_ppl" -v eps="$eps" \
-v tmin="$(awk -v e="$eps" 'BEGIN{print log(e)/log(10)}')" -v tr="$t_range" \
'BEGIN{d=p-min;printf "%.1f",500-400*((log(d+eps)/log(10)-tmin)/tr)}')
else
y=$(awk -v p="$pl" -v min="$min_ppl" -v r="$ppl_rng" \
'BEGIN{printf "%.1f",500-(p-min)*400/r}')
fi
echo "$x $y"
}
trend="M"
sorted_pareto_i=($(for i in "${pareto_i[@]}"; do
item=$(jq --argjson i "$i" '.data[$i]' "$config")
bpw=$(jq -r '.bpw' <<<"$item")
size=$(bc <<< "scale=4; $bpw * $mp / 8 / 1024 / 1024 / 1024")
size=$(printf "%.1f" "$size")
echo "$size $i"
done | sort -n | awk '{print $2}'))
for i in "${sorted_pareto_i[@]}"; do
item=$(jq --argjson i "$i" '.data[$i]' "$config")
bpw=$(jq -r '.bpw' <<<"$item")
ppl=$(jq -r '.ppl' <<<"$item")
size=$(bc <<< "scale=4; $bpw * $mp / 8 / 1024 / 1024 / 1024")
size=$(printf "%.1f" "$size")
read x y < <(to_xy "$size" "$ppl")
trend+=" $x $y"
done
# ------------------------------------------------------------------
# 7. Draw points (ascending ppl)
# ------------------------------------------------------------------
sorted_indices=($(jq -r '.data | sort_by(.ppl) | keys_unsorted[]' "$config"))
ly=$((top + h + gap + 70))
for idx in "${sorted_indices[@]}"; do
item=$(jq --argjson i "$idx" '.data[$i]' "$config")
name=$(jq -r '.name' <<<"$item")
bpw=$(jq -r '.bpw' <<<"$item")
ppl=$(jq -r '.ppl' <<<"$item")
size=$(bc <<< "scale=4; $bpw * $mp / 8 / 1024 / 1024 / 1024")
size=$(printf "%.1f" "$size")
url=$(jq -r '.url // ""' <<<"$item")
read x y < <(to_xy "$size" "$ppl")
xl=$(awk -v x="$x" 'BEGIN{printf "%.1f",x-10}')
yt=$(awk -v y="$y" 'BEGIN{printf "%.1f",y-10}')
xr=$(awk -v x="$x" 'BEGIN{printf "%.1f",x+10}')
c=$(printf '%s' "$name" | md5sum | awk '{print "#"substr($1,1,6)}')
dc=$(printf '%s' "${c#?}" | awk '{printf "#%02x%02x%02x", strtonum("0x"substr($0,1,2))*8/10, strtonum("0x"substr($0,3,2))*8/10, strtonum("0x"substr($0,5,2))*8/10}')
is_pareto=0
for i in "${pareto_i[@]}"; do
[[ $i -eq $idx ]] && { is_pareto=1; break; }
done
if [[ $is_pareto -eq 1 ]]; then
echo " <circle cx=\"$x\" cy=\"$y\" r=\"14\" fill=\"url(#halo)\"/>"
echo " <polygon points=\"$x,$y $xl,$yt $xr,$yt\" class=\"triangle\" fill=\"$c\" stroke=\"$dc\"/>"
else
echo " <polygon points=\"$x,$y $xl,$yt $xr,$yt\" class=\"triangle\" fill=\"$c\" stroke=\"$dc\"/>"
fi
qr=""
if [[ -n $url ]]; then
short=${url//https:\/\/huggingface.co\//hf.co/}
hsh=$(printf '%s' "$short" | md5sum | awk '{print $1}')
[[ -f $qr_dir/$hsh.svg ]] && qr_base64=$(base64 -w 0 "$qr_dir/$hsh.svg" 2>/dev/null || base64 "$qr_dir/$hsh.svg" | tr -d '\n')
qr="<image x=\"450\" y=\"$((ly-10))\" width=\"32\" height=\"32\" href=\"data:image/svg+xml;base64,$qr_base64\"/>"
fi
points+="\n <polygon points=\"70,$ly 60,$((ly+10)) 80,$((ly+10))\" fill=\"$c\""
if [[ $is_pareto -eq 0 ]]; then
points+=" stroke=\"#ff000050\" stroke-width=\"2\""
else
points+=" stroke=\"$dc\""
fi
points+="/>"
points+="\n <text class=\"legend-item\" x=\"100\" y=\"$((ly+10))\">$name: $bpw bpw, $ppl ppl</text>"
[[ -n $qr ]] && points+="\n $qr"
ly=$((ly+40))
done
points=$(echo -e "$points")
# ------------------------------------------------------------------
# 8. Trendline & legend
# ------------------------------------------------------------------
[[ ${#trend} -gt 1 ]] && \
echo " <path d=\"$trend\" fill=\"none\" stroke=\"#00c853\" stroke-width=\"1.5\" stroke-dasharray=\"6,3\" stroke-opacity=\"0.5\"/>"
cat <<EOF
<rect x="50" y="$((top + h + gap))" width="700" height="$((leg_h-5))" fill="#f8fafc" stroke="#e2e8f0" rx="5"/>
<text class="legend-title" x="70" y="$((top + h + gap + 25))">Quantization Details</text>
<g class="legend">$points
</g>
</svg>
EOF
|
Beta Was this translation helpful? Give feedback.
-
R1 stats (THIREUS quants added). {
"title": "DeepSeek-R1-0528 (671B) Quantization Analysis",
"subtitle": "Lower perplexity = Better performance",
"model_parameters": 671000000000,
"data": [
{"name": "IQ1_S_R4", "bpw": 1.664, "ppl": 4.8831, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ1_S_R4"},
{"name": "THIREUS-1.9364", "bpw": 1.9364, "ppl": 4.3533, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-1.9364bpw-4.3533ppl.151GB-GGUF_11GB-GPU_140GB-CPU.3c88ec6_9fd615d.recipe"},
{"name": "IQ2_KT", "bpw": 2.514, "ppl": 3.6378},
{"name": "THIREUS-2.7840", "bpw": 2.7840, "ppl": 3.4341, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-2.7840bpw-3.4341ppl.217GB-GGUF_14GB-GPU_203GB-CPU.3c88ec6_02247be.recipe"},
{"name": "IQ2_K_R4", "bpw": 2.799, "ppl": 3.5069, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ2_K_R4"},
{"name": "UD_Q2_K_XL", "bpw": 2.994, "ppl": 3.5278, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-Q2_K_XL"},
{"name": "THIREUS-3.1027", "bpw": 3.1027, "ppl": 3.3372, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1027bpw-3.3372ppl.242GB-GGUF_11GB-GPU_231GB-CPU.3c88ec6_adc8101.recipe"},
{"name": "THIREUS-3.1446", "bpw": 3.1446, "ppl": 3.3257, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1446bpw-3.3257ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_7d1efe1.recipe"},
{"name": "THIREUS-3.1447", "bpw": 3.1447, "ppl": 3.3269, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1447bpw-3.3269ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_4b1254a.recipe"},
{"name": "THIREUS-3.1525", "bpw": 3.1525, "ppl": 3.3251, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1525bpw-3.3251ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_5a3fc0f.recipe"},
{"name": "THIREUS-3.1740", "bpw": 3.1740, "ppl": 3.3253, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1740bpw-3.3253ppl.248GB-GGUF_17GB-GPU_231GB-CPU.3c88ec6_6cf3a72.recipe"},
{"name": "THIREUS-3.1858", "bpw": 3.1858, "ppl": 3.3261, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1858bpw-3.3261ppl.249GB-GGUF_18GB-GPU_231GB-CPU.3c88ec6_027b7ff.recipe"},
{"name": "THIREUS-3.2564", "bpw": 3.2564, "ppl": 3.2985, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.2564bpw-3.2985ppl.254GB-GGUF_15GB-GPU_239GB-CPU.3c88ec6_7c0be1e.recipe"},
{"name": "IQ3_KT", "bpw": 3.483, "ppl": 3.3056, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_KT"},
{"name": "THIREUS-3.5652", "bpw": 3.5652, "ppl": 3.2734, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.5652bpw-3.2734ppl.278GB-GGUF_14GB-GPU_264GB-CPU.3c88ec6_9b5660b.recipe"},
{"name": "IQ3_KS", "bpw": 3.598, "ppl": 3.2991, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_KS"},
{"name": "THIREUS-3.6766", "bpw": 3.6766, "ppl": 3.2741, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13781700"},
{"name": "IQ3_K_R4", "bpw": 3.847, "ppl": 3.2730, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_K_R4"},
{"name": "THIREUS-3.976", "bpw": 3.976, "ppl": 3.2452, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13798329"},
{"name": "IQ4_XS (unsloth)", "bpw": 4.2683, "ppl": 3.2598, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/IQ4_XS"},
{"name": "q4_0", "bpw": 4.508, "ppl": 3.2895, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/Q4_0"},
{"name": "UD_Q4_K_XL", "bpw": 4.578, "ppl": 3.2483, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-Q4_K_XL"},
{"name": "IQ4_KS_R4", "bpw": 4.701, "ppl": 3.2286, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ4_KS_R4"},
{"name": "DQ4_K_R4", "bpw": 5.289, "ppl": 3.2276, "url": "https://huggingface.co/anikifoss/DeepSeek-R1-0528-DQ4_K_R4"},
{"name": "THIREUS-6.2478", "bpw": 6.2478, "ppl": 3.2240, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13781560"},
{"name": "Q8_0", "bpw": 8.5259260, "ppl": 3.2130, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/Q8_0"}
]
}
|
Beta Was this translation helpful? Give feedback.
-
Is there any way to predict the performance of the quant (prefill and decode) based solely on the types of the quants used? |
Beta Was this translation helpful? Give feedback.
-
Yes: RAM bandwidth Take For example, Qwen3-30B-A3B has 3 billion
You can then measure how many tokens per second you get with Qwen3-30B-A3B, and calculate your system's memory bandwidth (often this is around 80% of the theoretical possible bandwidth). Once you have the system's effective memory bandwidth, you can then reverse the calculation to estimate tokens per second you will get with X active parameters:
Things get a little more tricky when you have a GPU in the mix. The same formula usually applies to GPU and VRAM (uncless the card is very weak at compute, like some older cards). However, if you have both GPU and CPU working together, then the slowest one (CPU) will be your bottleneck. Then you need to figure out how many active parameters will go on the GPU and how many will go on the CPU. |
Beta Was this translation helpful? Give feedback.
-
Prompt Processing uses a clever workaround to cheat the RAM bandwidth limitation. You multiply several tokens at the same time, that way you are re-using the data in the CPU cache, side-stepping the RAM bandwidth limit. |
Beta Was this translation helpful? Give feedback.
-
I have four units of MI50-32GB arriving soon (was a great deal, $150 each). Together with RTX-5090, it should give me 160GB of VRAM. So I can try benchmarking IQ1_S fully from VRAM. Does anyone have experience with MI50s or running a mixed ROCM/CUDA setup? If I can get MI50s working working, I'll try hooking up 22 of them into one system, for a total of 704GB VRAM. That should be enough to run my chunky Kimi-K2 quant. Will need to limit power consumption to 120W stay within 1600x2 Watts. I found some articles online with mixed feedback about MI50s, would really appreciate if someone could share the first hand experience! |
Beta Was this translation helpful? Give feedback.
-
Btw, because many people in this thread are running calculations with models that contained |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What
Starting this "show and tell" discussion about the updated DeepSeek-R1-0528 model and various quants beginning to emerge.
Info
ik_llama.cpp
exclusive quants released at ubergarm/DeepSeek-R1-0528-GGUF. I am curious what other sizes might be of interest to folks e.g. a larger one for big RAM systems or maybe a very small one sacrificing quality to fit in lower RAM/VRAM systems perhaps?attn_k_b
andattn_v_b
atQ8_0
for those tensors._R4
quants to run on CUDA (but requires explicitly setting-DGGML_CUDA_IQK_FORCE_BF16=1
compilation for this model).EDIT Check out this youtube video by fahdmirzac showing some examples of installing and running ik_llama.cpp with these quants here. Thanks Fahd!
Benchmarks
Perplexity
So far the perplexity values I've measured are as follows:
DeepSeek-R1-0528-Q8_0
666GiBFinal estimate: PPL = 3.2130 +/- 0.01698
DeepSeek-R1-0528-IQ3_K_R4
301GiBFinal estimate: PPL = 3.2730 +/- 0.01738
DeepSeek-R1-0528-IQ2_K_R4
220GiBFinal estimate: PPL = 3.5069 +/- 0.01893
Compare to my previous recipes for V3-0324:
DeepSeek-V3-0324-Q8_0
666GiBFinal estimate: PPL = 3.2454 +/- 0.01773
DeepSeek-V3-0324-IQ4_K_R4
387GiBFinal estimate: PPL = 3.2596 +/- 0.01786
DeepSeek-V3-0324-IQ2_K_R4
227GiBFinal estimate: PPL = 3.5614 +/- 0.02001
Speed
With time I hope to grab some
llama-sweep-bench
on these quants too.Conclusion
Thanks and let me know if you try these out or have questions or comments. Feel free to use the imatrix I uploaded as well to make your own quants. Cheers!
Beta Was this translation helpful? Give feedback.
All reactions