DeepSeek-R1-0528 ik quants! #477

ubergarm · 2025-05-30T20:10:52Z

ubergarm
May 30, 2025

What

Starting this "show and tell" discussion about the updated DeepSeek-R1-0528 model and various quants beginning to emerge.

Info

I just cooked up a couple ik_llama.cpp exclusive quants released at ubergarm/DeepSeek-R1-0528-GGUF. I am curious what other sizes might be of interest to folks e.g. a larger one for big RAM systems or maybe a very small one sacrificing quality to fit in lower RAM/VRAM systems perhaps?
I'm running some benchmarks to measure the effects of quantizing attn/shexp layers while holding exps constant given the recent MLA fixes here in PR411. Seems like mainline llama.cpp might have an issue still so folks are keeping attn_k_b and attn_v_b at Q8_0 for those tensors.
Folks might have questions about offloading extra layers and multi-gpu systems which hopefully will go smoother now with PR461 allowing repacked _R4 quants to run on CUDA (but requires explicitly setting -DGGML_CUDA_IQK_FORCE_BF16=1 compilation for this model).

EDIT Check out this youtube video by fahdmirzac showing some examples of installing and running ik_llama.cpp with these quants here. Thanks Fahd!

Benchmarks

Perplexity

So far the perplexity values I've measured are as follows:

DeepSeek-R1-0528-Q8_0 666GiB
- Final estimate: PPL = 3.2130 +/- 0.01698
DeepSeek-R1-0528-IQ3_K_R4 301GiB
- Final estimate: PPL = 3.2730 +/- 0.01738
- Fits 32k context in under 24GiB VRAM
DeepSeek-R1-0528-IQ2_K_R4 220GiB
- Final estimate: PPL = 3.5069 +/- 0.01893
- Fits 32k context in under 16GiB VRAM

Compare to my previous recipes for V3-0324:

DeepSeek-V3-0324-Q8_0 666GiB
- Final estimate: PPL = 3.2454 +/- 0.01773
DeepSeek-V3-0324-IQ4_K_R4 387GiB
- Final estimate: PPL = 3.2596 +/- 0.01786
DeepSeek-V3-0324-IQ2_K_R4 227GiB
- Final estimate: PPL = 3.5614 +/- 0.02001
- Fits 32k context in under 24GiB VRAM

Speed

With time I hope to grab some llama-sweep-bench on these quants too.

Conclusion

Thanks and let me know if you try these out or have questions or comments. Feel free to use the imatrix I uploaded as well to make your own quants. Cheers!

randoentity · 2025-05-31T05:56:18Z

randoentity
May 31, 2025

Thanks for these quants and the rest of your work you publish. Could you do one that fits in 128GB RAM and 72GB VRAM with 32K context? I tried the unsloth IQ1_S and got about 2.7 t/s generation on mainline and 2.15 t/s on ik. It was coherent and delivered surprisingly good responses to real world coding tasks. Oh but the R4 variants don't support Q1 yet, right?

6 replies

ubergarm Jun 2, 2025
Author

@randoentity

So I'm about to upload a IQ1_S_R4 1.664 BPW (131GiB) that might actually fit in 128GB RAM + 24GB VRAM and has lower perplexity than Qwen3-235B-A22B-Q8_0 haha... Not sure if it is "better" though, but kind of surprising.

If you have enough RAM+VRAM to fully fit a larger model I'd recommend that over this tiny one, and you probably won't be able to run the these repacked quants on CUDA yet to take advantage of offloading extra layers. Though you can up your -b 4096 -ub 4096 or possibly higher and use the full 160k context with all your extra VRAM.

It should be finished uploading by monday morning NYC Eastern Time.

randoentity Jun 2, 2025

I'm only getting 0.05 TG, probably because it isn't running on CUDA. Higher batch did improve TG on mainline.

ubergarm Jun 2, 2025
Author

@randoentity

I'm only getting 0.05 TG, probably because it isn't running on CUDA.

What are you trying to do? Test out the IQ1_S_R4 quant? Provide your full command here and we can workshop it as 0.05 tok/sec TG (assuming that is what you mean?) sounds low for a 128GB RAM + 72GB VRAM system. Also provide what mix of GPUs you have e.g. a 2x 3090s and whatever.

ThomasBaruzier Jun 2, 2025

#242 (comment)
@randoentity I have the same setup as you and managed 7tok/s TG and 40 tok/s PP

Edit: the setup described in the link probably needs updating with all the new PRs, like mla3, but I haven't tested yet

randoentity Jun 3, 2025

@ThomasBaruzier thanks! Unfortunately your example didn't help me. I had already tried that and other combinations.

Ph0rk0z · 2025-06-01T13:19:59Z

Ph0rk0z
Jun 1, 2025

Will -rtr fix the R4 quants so they don't have to use the BF16 path?

I downloaded IQ1_S from unsloth and got 90t/s PP but same and slightly lower 10.x t/s output. So IQ2_XXS from previous V3 is not much
different in that regard. Granted, I can use full 32k context now and maintain speeds.

Smaller AMB than 512 often lets you fit a couple more pieces due to the reduced buffer. Every little bit on GPU helps when CPU/Memory isn't that strong.

1 reply

ubergarm Jun 1, 2025
Author

Will -rtr fix the R4 quants so they don't have to use the BF16 path?

-rtr will try to make non _r4 quants into _r4 quants so I believe the answer is no. Though some folks are reporting -DGGML_CUDA_IQK_FORCE_BF16=1 is giving them a slight speed boost probably depending on what model GPU you have.

ubergarm · 2025-06-01T15:20:15Z

ubergarm
Jun 1, 2025
Author

I had an interesting report from huggingface.co/ciprianv that compiling with -DGGML_CUDA_IQK_FORCE_BF16=1 was giving a speed boost on these quants which is not what I expected.

I tried it out myself and confirmed with llama-sweep bench. This also is showing some small speed-ups by offloading additional layers onto GPU. I didn't have the patience to finish running one of them but you get the jist.

Interestingly it does suggest that for some hardware configurations it may be beneficial to PP to compile with -DGGML_CUDA_IQK_FORCE_BF16=1 which surprised me given discussion in PR#461

👈 Methodology and Data logs

Compilation flags with and without FORCE_BF16.

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build ./build --config Release -j $(nproc)

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=0

llama-sweep-bench test

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -mla 3 -fa \
  -amb 512 \
  -fmoe \
  -ctk f16 \
  -c 16384 \
  -ngl 99 \
  -ot "blk\.(3|4|5|6|7|8|9)\.ffn_.*=CUDA0" \ # <--- with or without extra layers offloaded to GPU
  -ot exps=CPU \
  --warmup-batch \
  --no-mmap \
  --threads 24

llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q5_0:   61 tensors
llama_model_loader: - type iq4_ks:  116 tensors
llama_model_loader: - type iq5_ks:  435 tensors
llama_model_loader: - type iq2_k_r4:  116 tensors
llama_model_loader: - type iq3_k_r4:   58 tensors

`-DGGML_CUDA_IQK_FORCE_BF16=1 -ot "blk\.(3|4|5|6|7|8|9)\.ffn_.*=CUDA0" -ot exps=CPU`

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	10.448	49.00	8.496	15.07
512	128	512	10.626	48.18	8.548	14.97
512	128	1024	10.704	47.83	8.601	14.88
512	128	1536	10.803	47.39	9.029	14.18
512	128	2048	10.938	46.81	8.645	14.81
512	128	2560	10.983	46.62	8.789	14.56
512	128	3072	11.132	46.00	8.824	14.51
512	128	3584	11.152	45.91	8.845	14.47
512	128	4096	11.285	45.37	9.060	14.13
512	128	4608	11.432	44.79	8.842	14.48
512	128	5120	11.415	44.85	8.893	14.39
512	128	5632	11.542	44.36	9.071	14.11
512	128	6144	11.605	44.12	9.085	14.09
512	128	6656	11.719	43.69	9.258	13.83
512	128	7168	11.851	43.20	9.104	14.06
512	128	7680	11.884	43.08	9.115	14.04
512	128	8192	12.052	42.48	9.434	13.57

`-DGGML_CUDA_IQK_FORCE_BF16=1 -ot exps=CPU`

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	11.488	44.57	8.968	14.27
512	128	512	11.665	43.89	8.923	14.34
512	128	1024	11.746	43.59	8.912	14.36
512	128	1536	11.841	43.24	9.110	14.05
512	128	2048	11.981	42.73	8.966	14.28
512	128	2560	12.023	42.58	9.144	14.00
512	128	3072	12.112	42.27	9.216	13.89
512	128	3584	12.257	41.77	9.215	13.89
512	128	4096	12.323	41.55	9.224	13.88
512	128	4608	12.452	41.12	9.191	13.93
512	128	5120	12.512	40.92	9.220	13.88
512	128	5632	12.555	40.78	9.378	13.65
512	128	6144	12.695	40.33	9.354	13.68
512	128	6656	12.822	39.93	9.480	13.50
512	128	7168	12.829	39.91	9.454	13.54
512	128	7680	12.937	39.58	9.502	13.47
512	128	8192	13.148	38.94	9.604	13.33
512	128	8704	13.142	38.96	9.626	13.30
512	128	9216	13.268	38.59	9.758	13.12
512	128	9728	13.410	38.18	9.604	13.33
512	128	10240	13.429	38.13	9.613	13.32
512	128	10752	13.522	37.87	9.856	12.99
512	128	11264	13.653	37.50	9.790	13.08
512	128	11776	13.780	37.15	9.779	13.09
512	128	12288	13.772	37.18	9.825	13.03
512	128	12800	13.886	36.87	10.041	12.75
512	128	13312	14.037	36.47	9.906	12.92
512	128	13824	14.078	36.37	10.013	12.78
512	128	14336	14.178	36.11	10.172	12.58
512	128	14848	14.289	35.83	10.043	12.74
512	128	15360	14.406	35.54	9.980	12.83
512	128	15872	14.414	35.52	10.023	12.77

`-DGGML_CUDA_IQK_FORCE_BF16=0 -ot exps=CPU`

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	12.572	40.73	8.800	14.55
512	128	512	12.639	40.51	8.911	14.36
512	128	1024	12.810	39.97	9.140	14.00
512	128	1536	12.985	39.43	8.942	14.31
512	128	2048	12.998	39.39	9.217	13.89
512	128	2560	13.119	39.03	9.378	13.65
512	128	3072	13.247	38.65	9.137	14.01
512	128	3584	13.293	38.52	9.186	13.93
512	128	4096	13.488	37.96	9.341	13.70
512	128	4608	13.496	37.94	9.235	13.86
512	128	5120	13.522	37.86	9.405	13.61
512	128	5632	13.695	37.39	9.388	13.63
512	128	6144	13.716	37.33	9.352	13.69
512	128	6656	13.905	36.82	9.530	13.43
512	128	7168	13.911	36.80	9.413	13.60
512	128	7680	14.024	36.51	9.630	13.29
512	128	8192	14.210	36.03	9.601	13.33
512	128	8704	14.277	35.86	9.595	13.34
512	128	9216	14.361	35.65	9.571	13.37
512	128	9728	14.438	35.46	9.798	13.06
512	128	10240	14.577	35.12	9.717	13.17
512	128	10752	14.605	35.06	9.887	12.95
512	128	11264	14.683	34.87	10.044	12.74
512	128	11776	14.881	34.41	9.796	13.07
512	128	12288	14.909	34.34	9.840	13.01
512	128	12800	14.982	34.18	9.832	13.02
512	128	13312	15.094	33.92	10.101	12.67
512	128	13824	15.219	33.64	10.060	12.72
512	128	14336	15.265	33.54	10.282	12.45
512	128	14848	15.333	33.39	10.172	12.58
512	128	15360	15.493	33.05	9.979	12.83
512	128	15872	15.553	32.92	9.987	12.82

9 replies

Ph0rk0z Jun 1, 2025

-b 4096 -ub 4096

This gives me a bump from 90 to 127 but the buffer sizes mean I have to offload less layers. Offloading the wrong things can cause PCIE related gpu bottleneck too.

RodriMora Jun 2, 2025

results with and without -b 4096 -ub 4096

I can offload a few more layers without -b 4096 -ub 4096 giving a bit better TG

llama-sweep-bench command with defaults -b and -ub and a bit more layers

CUDA_VISIBLE_DEVICES="2,4,0,1,3,5" \
                                             ./build/bin/llama-sweep-bench \
                                               --model /home/ubuntuai/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
                                               --alias ubergarm/DeepSeek-R1-0528-IQ2_K_R4 -mla 3 -fa \
                                               -amb 512 \
                                               -fmoe \
                                               -ctk f16 \
                                               -c 16384 \
                                               -ngl 99 \
                                               -ot "blk\.(3|4|5|6|7|8|9)\.ffn_.*=CUDA0" \
                                               -ot "blk\.(10|11|12|13|14|15|16)\.ffn_.*=CUDA1" \
                                               -ot "blk\.(17|18|19|20|21)\.ffn_.*=CUDA2" \
                                               -ot "blk\.(22|23|24|25|26)\.ffn_.*=CUDA3" \
                                               -ot "blk\.(27|28|29|30|31)\.ffn_.*=CUDA4" \
                                               -ot "blk\.(32|33|34|35|36)\.ffn_.*=CUDA5" \
                                               -ot exps=CPU \
                                               --warmup-batch \
                                               --no-mmap \
                                               --threads 24

llama-sweep-bench command with -b 4096 -ub 4096 but less layers into vram

CUDA_VISIBLE_DEVICES="2,4,0,1,3,5" \
                                             ./build/bin/llama-sweep-bench \
                                               --model /home/ubuntuai/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
                                               --alias ubergarm/DeepSeek-R1-0528-IQ2_K_R4 -mla 3 -fa \
                                               -amb 512 \
                                               -fmoe \
                                               -ctk f16 \
                                               -c 16384 \
                                               -ngl 99 \
                                               -ot "blk\.(3|4|5|6|7|8)\.ffn_.*=CUDA0" \
                                               -ot "blk\.(9|10|11|12|13|14)\.ffn_.*=CUDA1" \
                                               -ot "blk\.(15|16|17|18)\.ffn_.*=CUDA2" \
                                               -ot "blk\.(20|21|22|23)\.ffn_.*=CUDA3" \
                                               -ot "blk\.(25|26|27|28)\.ffn_.*=CUDA4" \
                                               -ot "blk\.(30|31|32|33)\.ffn_.*=CUDA5" \
                                               -ot exps=CPU \
                                               -b 4096 -ub 4096 \
                                               --warmup-batch \
                                               --no-mmap \
                                               --threads 24

compiled with:
pulled this commit 7a8abe2

rm -rf build
cmake -B build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build build --config Release -j$(nproc)

cmoncure Jun 2, 2025

Offloading the wrong things can cause PCIE related gpu bottleneck too.

Tell me more. Isn't -ot just a static offload of tensors, and if you put too many, the process blows up when it runs out of VRAM? Where does PCI-E come into play?

Ph0rk0z Jun 2, 2025

If you split a layer across cards you can have a situation where GPU usage is high and they transfer a lot of data back and forth. Like place a gate on one and down on another. The CPU usage then craters to half or less and your overall speed is cooked. Especially evident for RTR. Remember a forward pass goes through these weights and I think passes states along.

ubergarm Jun 3, 2025
Author

@RodriMora

Thanks for the graphs. I thought I recognized that combination of GPUs from reddit lmao... Cheers at stitching together a sweet vibe coding rig haha

anikifoss · 2025-06-01T16:12:44Z

anikifoss
Jun 1, 2025

I uploaded the custom quant I use for coding here with some of the infromation how I arrived there and relevant benchmarks. I added some teasers on command line arguments to experiment with, as this branch is moving quickly and small performance improvements can add up over time.

1 reply

ubergarm Jun 4, 2025
Author

Thanks again for your quant, pretty sure it is the biggest boi of them all so a great choice for anyone with a big rig that wants the more BPW than my quants!

ubergarm · 2025-06-01T18:28:15Z

ubergarm
Jun 1, 2025
Author

Qantization Effects of `attn`/`shexp` on Perplexity

Motivation

I would be curious to see how much degradation in quality there is from using 6- or 5-bit quants for the attention tensors and shared experts. @ikawrakow

This research grew out of PR#411 discussions. I've expanded on ik's example bash script to create 10 test quants each about ~355GiB in size. All the quants hold constant q4_0 for ffn.* and token_embd while varying attn.* and shexp using all quants between 4~6bpw.

If anyone wants to publish this, just hit me up and just cite myself and the project here appropriately.

EDIT Added the new iq4_kt trellis quant to graph and data!

Results

Methodology and Data

I chose the Y-Axis scale based on some discussion here. The actual reported Final PPL values are in the annotation and for scale perspective the "worst case" q4_0 is only about 1.5% higher PPL than the q8_0.

You can check the scripts below for exact quantization strategies, and do note that I left attn_k_b the closest sized qN_0 quant due to size restrictions preventing using iqN_k etc.

👈 Scripts and Logs

quantization script

#!/usr/bin/env bash

model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-256x21B-0528-BF16-00001-of-00030.gguf
imatrix=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/imatrix-DeepSeek-R1-0528.dat
outdir=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF
basename=DeepSeek-R1-0528
base_q=q4_0

# iterate over list of tuples as attn_k_b shape requires qN_0 types
for q in q8_0,q8_0 q6_0,q6_K q6_0,iq6_k q5_0,q5_K q5_0,iq5_k q5_0,iq5_ks q4_0,q4_K q4_0,iq4_k q4_0,iq4_ks q4_0,q4_0
do
    # unpack tuples into $1,$2
    IFS=","
    set -- $q

    # quantize using $1 for attn_k_b and $2 for rest of attn and base_q for all else
    numactl --interleave=all \
    ./build/bin/llama-quantize \
        --imatrix $imatrix \
        --custom-q attn_k_b=$1 \
        --custom-q attn=$2 \
        --custom-q shexp=$2 \
        --custom-q exps=$base_q \
        $model \
        $outdir/$basename-$base_q-attn-shexp-$2.gguf \
        $base_q \
        2>&1 | tee -a logs/quantize-$basename-$base_q-attn-shexp-$2.log
done

resultant test quants

$ du -h /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/*q4_0*
353G    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-q4_0-attn-shexp-iq4_k.gguf
353G    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-q4_0-attn-shexp-iq4_ks.gguf
355G    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-q4_0-attn-shexp-iq5_k.gguf
355G    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-q4_0-attn-shexp-iq5_ks.gguf
357G    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-q4_0-attn-shexp-iq6_k.gguf
353G    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-q4_0-attn-shexp-q4_0.gguf
353G    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-q4_0-attn-shexp-q4_K.gguf
355G    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-q4_0-attn-shexp-q5_K.gguf
357G    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-q4_0-attn-shexp-q6_K.gguf
360G    /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-q4_0-attn-shexp-q8_0.gguf

perplexity test script

#!/usr/bin/env bash

for model in $(ls /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/*q4_0*.gguf); do
    logfile=logs/perplexity-$(basename "${model%.*}").log

    numactl -N 0,1,2 --interleave=0,1,2 \
    ./build/bin/llama-perplexity \
        --model "$model" \
        -mla 3 -fa \
        -amb 512 \
        -rtr \
        -fmoe \
        -f wiki.test.raw \
        --seed 1337 \
        --threads 128 \
        --numa numactl \
        2>&1 | tee -a $logfile
done

raw data in JSON format

[
  {
    "name": "q4_0",
    "ppl": "3.2895 +/- 0.01755",
    "size": 352.656,
    "bpw": 4.508,
    "legend": "test"
  },
  {
    "name": "q4_K",
    "ppl": "3.2688 +/- 0.01739",
    "size": 352.656,
    "bpw": 4.508,
    "legend": "test"
  },
  {
    "name": "iq4_k",
    "ppl": "3.2713 +/- 0.01742",
    "size": 352.656,
    "bpw": 4.508,
    "legend": "test"
  },
  {
    "name": "iq4_ks",
    "ppl": "3.2676 +/- 0.01736",
    "size": 352.255,
    "bpw": 4.502,
    "legend": "test"
  },
  {
    "name": "iq4_kt",
    "ppl": "3.2832 +/- 0.01749",
    "size": 351.855,
    "bpw": 4.497,
    "legend": "test"
  },
  {
    "name": "q5_K",
    "ppl": "3.2565 +/- 0.01729",
    "size": 354.401,
    "bpw": 4.530,
    "legend": "test"
  },
  {
    "name": "iq5_k",
    "ppl": "3.2555 +/- 0.01729",
    "size": 354.401,
    "bpw": 4.530,
    "legend": "test"
  },
  {
    "name": "iq5_ks",
    "ppl": "3.2541 +/- 0.01726",
    "size": 354.001,
    "bpw": 4.525,
    "legend": "test"
  },
  {
    "name": "q6_K",
    "ppl": "3.2553 +/- 0.01732",
    "size": 356.251,
    "bpw": 4.553,
    "legend": "test"
  },
  {
    "name": "iq6_k",
    "ppl": "3.2577 +/- 0.01729",
    "size": 356.357,
    "bpw": 4.555,
    "legend": "test"
  },
  {
    "name": "q8_0",
    "ppl": "3.2485 +/- 0.01722",
    "size": 359.636,
    "bpw": 4.597,
    "legend": "test"
  }
]

python script for plotting

I vibe coded this using my R1-0528-IQ2_K_R4 and it loads the JSON I manually created as a file. Hopefully it didn't hallucinate anything haha...

import json
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
from adjustText import adjust_text
import numpy as np
from matplotlib.lines import Line2D

# Read JSON data from file
with open('ppl-r1-0528.json', 'r') as f:
    data = json.load(f)

# Filter out incomplete entries and extract mean perplexity and error
filtered_data = []
for entry in data:
    if 'ppl' in entry and 'size' in entry and 'bpw' in entry and 'legend' in entry:
        # Parse perplexity string to get mean and error
        ppl_parts = entry['ppl'].split()
        mean_ppl = float(ppl_parts[0])
        error = float(ppl_parts[2])  # The value after "+/-"

        filtered_data.append({
            'name': entry['name'],
            'mean_ppl': mean_ppl,
            'error': error,
            'size': float(entry['size']),
            'bpw': float(entry['bpw']),
            'legend': entry['legend']
        })

# Sort by size (smallest to largest)
sorted_data = sorted(filtered_data, key=lambda x: x['size'])

# Prepare plot data
names = [d['name'] for d in sorted_data]
sizes = [d['size'] for d in sorted_data]
ppls = [d['mean_ppl'] for d in sorted_data]
errors = [d['error'] for d in sorted_data]
bpws = [d['bpw'] for d in sorted_data]
legends = [d['legend'] for d in sorted_data]

# Find minimum perplexity (best model)
min_ppl = min(ppls)

# Calculate ln(PPL/min_ppl) for each point
ln_ratios = [np.log(p / min_ppl) for p in ppls]
# Calculate error for ln ratio: d(ln(p)) = dp/p
ln_ratio_errors = [e / p for e, p in zip(errors, ppls)]

# Create annotation labels (show original perplexity values)
labels = [
    f"{name}\nppl: {ppl:.4f}\nbpw: {bpw:.3f}"
    for name, ppl, bpw in zip(names, ppls, bpws)
]

# Apply solarized style
plt.style.use('Solarize_Light2')

# Create figure
fig, ax = plt.subplots(figsize=(12, 8))

# Set Y-axis limits for ln ratio
ax.set_ylim(0, 0.015)  # Adjusted for ln(PPL/min) scale
ax.set_xlim(min(sizes)-0.5, max(sizes)+0.5)

# Set labels
ax.set_xlabel('Model Size (GiB)', fontsize=12)
ax.set_ylabel('ln(PPL / min(PPL)) wiki.test.raw', fontsize=12)  # Updated Y-axis label

# Set title and subtitle with increased padding
main_title = "DeepSeek-R1-0528 ik_llama.cpp"
subtitle = "Varying attn/shexp with fixed Q4_0 exps/token_embd"

ax.set_title(main_title, fontsize=16, pad=40)
ax.text(0.5, 1.05, subtitle, transform=ax.transAxes,
        ha='center', fontsize=13, style='italic', color='#586e75')

# Add grid
ax.grid(True, linestyle='--', alpha=0.7)

# Plot dotted connecting line
ax.plot(sizes, ln_ratios, ':', color='#586e75', linewidth=1.5, zorder=1)

# Define unique markers and color map for legend groups
markers = ['o', 's', '^', 'D', 'v', '*', 'p', 'h', 'X', 'd', 'P', '>']
unique_legends = sorted(set(legends))  # Sort for consistent ordering
colors = plt.cm.Set2(np.linspace(0, 1, len(unique_legends)))

# Create mapping from legend to color and marker
legend_color_map = {legend: colors[i] for i, legend in enumerate(unique_legends)}
legend_marker_map = {legend: markers[i % len(markers)] for i, legend in enumerate(unique_legends)}

# Plot each point with error bars, using group-specific color and marker
for i, (size, ln_ratio, ln_error, legend) in enumerate(zip(sizes, ln_ratios, ln_ratio_errors, legends)):
    # Get color and marker for this legend group
    color = legend_color_map[legend]
    marker = legend_marker_map[legend]

    # Add error bar
    ax.errorbar(
        size,
        ln_ratio,
        yerr=ln_error,
        fmt='none',  # Don't plot main line
        ecolor=color,
        elinewidth=1.5,
        capsize=4,
        alpha=0.7,
        zorder=2
    )

    # Add scatter point with marker based on legend
    ax.scatter(
        size,
        ln_ratio,
        marker=marker,
        color=color,
        s=100,
        edgecolor='#586e75',  # Solarized base01 for outline
        linewidth=0.8,
        zorder=3
    )

# Create text annotations without boxes
texts = []
for size, ln_ratio, label in zip(sizes, ln_ratios, labels):
    texts.append(
        plt.text(
            size,
            ln_ratio,
            label,
            fontsize=9,
            ha='center',
            va='bottom',
            zorder=4
        )
    )

# Adjust text positions to avoid overlaps
adjust_text(
    texts,
    x=sizes,
    y=ln_ratios,
    arrowprops=dict(
        arrowstyle='->',
        color='#586e75',  # Solarized base01
        alpha=0.7,
        linewidth=1.0
    ),
    expand=(1.2, 1.8),
    ensure_inside_axes=True,
    min_arrow_len=0.15,
    prevent_crossings=False,
    only_move={'points': 'xy', 'text': 'xy'},
    max_move=150
)

# Add horizontal line at 0 for reference (ln(1)=0)
ax.axhline(y=0, color='#93a1a1', linestyle='-', linewidth=0.5, alpha=0.5, zorder=0)

# Create custom legend for legend groups with group-specific colors
legend_handles = [
    Line2D([0], [0],
           marker=legend_marker_map[legend],
           color=legend_color_map[legend],
           markersize=10,
           label=legend,
           linewidth=0,
           markeredgecolor='gray')
    for legend in unique_legends
]

# Add legend to plot
ax.legend(
    handles=legend_handles,
    title='Legend Groups',
    loc='upper right',
    framealpha=0.9
)

# Save figure
out_filename = 'ppl-r1-0528.png'
plt.tight_layout()
plt.savefig(out_filename, dpi=150, bbox_inches='tight')
print(f"Plot saved to {out_filename}")
print(f"Reference: Minimum perplexity = {min_ppl:.4f} (q8_0 model)")

Conclusion

My personal observations and thoughts are:

Even the Q4_0 is only about 1.5% worse than full Q8_0 attn/shexp. So not sacrificing a ton for likely faster TG speeds.
I was surprised that the iq6_k was slightly "worse" than the q6_K
The 32 block size _ks quants are looking really strong here especially given recent CUDA speed-ups. I'm eyeing that iq5_ks for future recipes and glad I already used them my released IQ2_K_R4
The error bars crack me up.

5 replies

ubergarm Jun 2, 2025
Author

Just ran some perplexity numbers for all of the quants I've released to huggingface. Running a few KLD on a very short "novel" test corpus also mainly to compare against quants from other cookers using different imatrix test corpus and methodologies and confirm if the PPL compares between us all okay or what.

Interestingly the small IQ1_S_R4 has a perplexity lower than Qwen3-235B-A22B-Q8_0=Final estimate: PPL = 5.3141 +/- 0.03321 232.769 GiB though that doesn't necessarily mean it is "better" but possibly more trained against wiki.test.raw?

ikawrakow Jun 2, 2025
Maintainer

So, iq5_ks looks like the winning option for attention tensors.

Concerning IQ1_S lower PPL: these are two different models, so PPL cannot be used to compare. PPL is useful for measuring quality degradation with different quantization types applied to the same model. My guess is that the PPL difference between f16 (or Q8_0) Qwen3-235B-A22B and DeepSeek-R1 is quite large.

ubergarm Jun 2, 2025
Author

So, iq5_ks looks like the winning option for attention tensors.

Yes, just for fun I ran a very short kld test corpus against them as well. The graph is kind of gnarly but is attempting to show RMS Δp, 99.0% Δp, and Maximum Δp percentage for each of the experimental attn/shexp quants. Seems to still point towards iq5_ks as it has a surprisingly tight Δp relative the to pure q8_0 everything ~666GiB baseline.

EDIT Added the new iq4_kt trellis quant to graph and data!

Each experimental quant has 3x data points plotted in a vertical line. It isn't super clear but here is the JSON data if anyone wants to slice and dice it further.

👈 JSON datafile

[
  {
    "name": "q4_0",
    "ppl": "3.2895 +/- 0.01755",
    "size": 352.656,
    "bpw": 4.508,
    "legend": "baseline",
    "dp_max": 31.887,
    "dp_99": 10.354,
    "dp_rms": "3.775 +/- 0.062"
  },
  {
    "name": "q4_K",
    "ppl": "3.2688 +/- 0.01739",
    "size": 352.656,
    "bpw": 4.508,
    "legend": "test",
    "dp_max": 29.435,
    "dp_99": 9.642,
    "dp_rms": "3.347 +/- 0.062"
  },
  {
    "name": "iq4_k",
    "ppl": "3.2713 +/- 0.01742",
    "size": 352.656,
    "bpw": 4.508,
    "legend": "test",
    "dp_max": 24.338,
    "dp_99": 9.274,
    "dp_rms": "3.067 +/- 0.051"
  },
  {
    "name": "iq4_ks",
    "ppl": "3.2676 +/- 0.01736",
    "size": 352.255,
    "bpw": 4.502,
    "legend": "test",
    "dp_max": 41.175,
    "dp_99": 9.538,
    "dp_rms": "3.259 +/- 0.061"
  },
  {
    "name": "iq4_kt",
    "ppl": "3.2832 +/- 0.01749",
    "size": 351.855,
    "bpw": 4.497,
    "legend": "test",
    "dp_max": 46.908,
    "dp_99": 9.005,
    "dp_rms": "3.221 +/- 0.073"
  },
  {
    "name": "q5_K",
    "ppl": "3.2565 +/- 0.01729",
    "size": 354.401,
    "bpw": 4.530,
    "legend": "test",
    "dp_max": 25.725,
    "dp_99": 8.523,
    "dp_rms": "2.859 +/- 0.051"
  },
  {
    "name": "iq5_k",
    "ppl": "3.2555 +/- 0.01729",
    "size": 354.401,
    "bpw": 4.530,
    "legend": "test",
    "dp_max": 28.849,
    "dp_99": 8.484,
    "dp_rms": "2.772 +/- 0.055"
  },
  {
    "name": "iq5_ks",
    "ppl": "3.2541 +/- 0.01726",
    "size": 354.001,
    "bpw": 4.525,
    "legend": "test",
    "dp_max": 22.856,
    "dp_99": 8.026,
    "dp_rms": "2.780 +/- 0.052"
  },
  {
    "name": "q6_K",
    "ppl": "3.2553 +/- 0.01732",
    "size": 356.251,
    "bpw": 4.553,
    "legend": "test",
    "dp_max": 42.780,
    "dp_99": 8.358,
    "dp_rms": "2.707 +/- 0.060"
  },
  {
    "name": "iq6_k",
    "ppl": "3.2577 +/- 0.01729",
    "size": 356.357,
    "bpw": 4.555,
    "legend": "test",
    "dp_max": 31.809,
    "dp_99": 8.842,
    "dp_rms": "2.854 +/- 0.055"
  },
  {
    "name": "q8_0",
    "ppl": "3.2485 +/- 0.01722",
    "size": 359.636,
    "bpw": 4.597,
    "legend": "test",
    "dp_max": 26.032,
    "dp_99": 6.632,
    "dp_rms": "2.236 +/- 0.053"
  }]

PPL is useful for measuring quality degradation with different quantization types applied to the same model.

Thanks, that makes sense. I'm wondering if it is okay to use PPL to measure relative quality of the same model quantized with different imatrix corpus / methodologies? I don't know how much stock to put into my PPL comparisons of R1-0528 quants done by myself, unsloth, bartowski, given somewhat varying imatrix methodologies.

saood06 Jun 4, 2025
Collaborator

Yes, just for fun I ran a very short kld test corpus against them as well. The graph is kind of gnarly but is attempting to show RMS Δp, 99.0% Δp, and Maximum Δp percentage for each of the experimental attn/shexp quants. Seems to still point towards iq5_ks as it has a surprisingly tight Δp relative the to pure q8_0 everything ~666GiB baseline.

If you find it fun/interesting can you see what quants you have pass the maze test. As mentioned here #383 (comment), I found it quite interesting the difference in pass rate between IQ4_K_R4 and IQ4_KS_R4.

If you don't find it fun/interesting then don't bother.

randoentity Jun 4, 2025

I tried one pass and the IQ1_S succeeded, but it took 19 minutes of thinking (at 4.7 t/s).

Edit: 3/3 so far, quasi-random maze (I skipped ones that required fewer than 3 steps).

Ph0rk0z · 2025-06-02T11:12:38Z

Ph0rk0z
Jun 2, 2025

So here is a new surprise, since I'm eying that IQ1 quant you're publishing. On a lark I turned off the -rtr switch and in unsloth's quant, it was cutting my prompt processing by half. It did buff textgen to over 11t/s though. The mind wobbles. Will try reloading the larger quant of V3 to check results. On Qwens it sped things up 100%

On another note, I tried to test mainline llama and that sweep bench segfaults with deepseek and does not recognize the -FA parameter. I was able to load on llama-server and get a blazing fast 6t/s PP, 6t/s TG. So much for that.

1 reply

ubergarm Jun 4, 2025
Author

Check out this PR492, given one cannot simply repack IQ1_S to IQ1_S_R4 is possibly related to the mind wobbles. haha..

cmoncure · 2025-06-03T00:58:43Z

cmoncure
Jun 3, 2025

Still struggling to understand some things.

✔ All tensors on CPU
✔ exps=CPU, -ngl 99 -ot attn=GPU0 -sm none
✔ exps=CPU, -ngl 99 attn=GPU0, blk.3.ffn_.*=GPU0 -sm none
✔ exps=CPU, -ngl 8 -sm layer
✘ exps=CPU, blk.3.ffn_.*=GPU0, blk.4.ffn_.*=GPU1 -sm none illegal memory access
✘ exps=CPU, blk.3.ffn_.*=GPU0, blk.4.ffn_.*=GPU1 -sm layer tries to allocate 1.5 TB of VRAM
✘ --run-time-repack -sm layer OOM killed??

With Q4_K_M -rtr -sm none -ot attn=GPU0 I get 80-90 PP and 14-16 TG. CUDA0 ~25% utilization during PP, 43% during TG.

With Q4_K_M -ngl 8 -sm layer -b 4096 it's 180-200 PP but less ideal 6-8 TG. CUDA0 100% utilization and CUDA1 <10% utilization with just a tiny blip of activity every batch. I guess the contribution of CUDA1 here is nominal.

(IQ4_K_R4 -ngl 8 -sm layer -b 4096 performance is not "tokens per second" but "seconds per token")

Either way I have a whole GPU worth of compute just sitting idle. There has to be a way to utilize it. Can I not have the -ngl 8 -sm layer approach during PP on CUDA0, and then the -rtr -sm none approach during TG on CUDA1? Can I produce a quant that gets me the best of both worlds?

34 replies

ikawrakow Jun 6, 2025
Maintainer

Does #495 solve the -fmoe issue with Unsloth's model?

randoentity Jun 6, 2025

For those with multi-GPU having uneven bandwidth (i.e. different number of lanes or PCIe generation): try playing with --tensor-split. I got from 175 PP 5.6 TG to 200 PP 6.0 TG by setting it to 1,0,0. Having fewer full layers on the fastest GPU, but more tensors overall seems to give a modest boost.

I also found that -amb doesn't do much for speeds, so setting it to 64 frees up some memory (lower doesn't work).

Finally, the bf16 compilation option prevents use of ctk q8_0, and I have to double check this still, but the speed boost doesn't seem significant on the R4 quant.

ikawrakow Jun 6, 2025
Maintainer

Finally, the bf16 compilation option prevents use of ctk q8_0

This would be news to me.

I also found that -amb doesn't do much for speeds, so setting it to 64 frees up some memory (lower doesn't work).

For your specific system, with the specific model you are using. The -amb option was added in PR #260, which has an explanation what it does. Please don't recommend -amb 64 as a general truth to others.

randoentity Jun 6, 2025

I've created #499 for the error.

Thanks for the link to the explanation for -amb! I didn't mean to spread misinformation, sorry. It was meant in the context of multi-GPU, this model, and this quant.

Ph0rk0z Jun 6, 2025

I have set BF16 and almost always use Q8 cache with different AMB, including 64. It shrinks the compute buffer so you can fit another piece of a layer or layer itself. For me it also didn't do much for speeds on it's own. Best to benchmark. Has worked both with deepseek and qwen including the IQ1 unsloth.

cmoncure · 2025-06-03T21:25:22Z

cmoncure
Jun 3, 2025

Can anyone explain to me in simple terms. When considering tensor offload configurations, what exactly is the nature of the stickiness or entanglement between tensors? What tensors MUST go together as an indivisible unit?

✔ All tensors on CPU
✔ All tensors on GPU
✔ attn=CUDA0 exps=CPU
✔ blk.(3|4|5|6).ffn_*=CUDA0 exps=CPU

FACT: attention and exps can be separated between CPU and GPU.
FACT: Entire layers can be offloaded from CPU to GPU.

But, you want to do something like this?

✘ attn=CUDA0, blk.3.=CUDA1 exps=CPU
✘ blk.3.ffn_.=CUDA0, blk.4.ffn_.*=CUDA1 exps=CPU
✘ R4 quant layers with -sm none => CUDA0; K quant layers with -sm layer => CUDA1

Are these impossible for REASONS or just "not supported" i.e. go learn the domain and write the code myself?

5 replies

Thireus Jun 3, 2025

I'm reading this answer - https://chatgpt.com/share/683f69cc-bff8-800f-8610-55aa4de145ed

ubergarm Jun 3, 2025
Author

@cmoncure

Zero offense intended, and just being a mirror, but for some reason I have a hard time understanding your writing for some reason. Perhaps you're just asking broad questions beyond my level of understanding as my brain is usually in the weeds ignoring the forest to mix my metaphores haha... Are you maybe copy pasting ai generated stuff as I never type unicode checks and x's. Anyway, just working on my communication, thanks.

Let me try to answer what makes sense to me:

What tensors MUST go together as an indivisible unit?

If you are using -fmoe which I believe you should be then check out PR229 where ffn_(up|gate) computation was optimized in such a way that I'd recommend not putting those on different devices for a given layer.

In general you want to avoid sending data between different devices as it incurs some time to copy it from say the GPU to the CPU or from CUDA0 via the PCIe bus to CUDA1 etc. Most folks here don't have magic RDMA GPUs nor P2P drivers nor NVLinks which can help with that.

Are these impossible for REASONS or just "not supported" i.e. go learn the domain and write the code myself?

mu

cmoncure Jun 4, 2025

"go learn the domain and write the code yourself" then, got it.

cmoncure Jun 4, 2025

attn=CUDA0, blk.3=CUDA1, exps=CPU

If “blk.3” means “all of layer 3 (attention + feed‑forward)” goes to CUDA:1, but you also try to put “attention” itself (the subcomponent of layer 3) on CUDA:0, you’ve overlapped. The “attention” sub‐block lives partly on CUDA:0 (its matmuls → exps) and partly on CUDA:1 (the rest of the layer 3). As soon as you compute context = softmax(scores) @ V, you need Q/K/V and the output projection to be together. If some of those weights/activations are on CUDA:1 and some on CUDA:0, you’d have to copy intermediates back and forth in the middle of that attention forward. In practice, no mainstream codebase will (a) know how to break attention in exactly two devices at the same time, or (b) optimize all of those back‑and‑forth copies.

Well, let's look at this helpful and reasonable explanation from ChatGPT. All is well and good here! No codebase can handle this scenario where the whole of layer 3 (attention + feed forward) goes to CUDA1, but attention remains on CUDA0, because the activations get split between CUDA0 and CUDA1. Totally makes sense.

Okay well, how then does this work when I do -ot attn=CUDA0 exps=CPU? Now attention is on CUDA0 and feed forward is on CPU... they are split! IMPOSSIBLE! ... impossible, right ChatGPT? 😮‍💨

Ph0rk0z Jun 4, 2025

ffn_(up|gate) computation was optimized in such a way that I'd recommend not putting those on different devices for a given layer.

So that explains why that causes being GPU bound. It seems I can put individual ups or gates on GPU vs CPU but I can't put one up or gate from the same layer on different GPUs. Both up/gate on the same GPU speeds things up though.

cmoncure · 2025-06-06T15:04:26Z

cmoncure
Jun 6, 2025

Day 4 of chasing performance with bespoke repacking and the delicate and mercurial (i.e. broken) configuration args. I'm ready to give up. I tried so many blends of tensor offload parameters and statically repacking my head is spinning. Nothing I tried can reach the high water marks of:
16 TG t/s with --rtr -ot attn=CUDA0 (but bad PP)
200 PP t/s with no repacking and -sm layer -ngl 8 (but bad TG)

I made a repacked quant that converts only the exps tensors running on CPU to _r4 (exps 11...60) and run everything else on CUDA0 and CUDA1 with --sm layer. It should be the best of both worlds, but it's the worst of both worlds: PP 71 and TG 9.

The domain may seem like black magic but at the end of the day all we're doing here is matrix multiplication. My instinct is screaming at me that there's huge amounts of performance left on the table. The wild and frankly shocking comment that "high gpu utilization is actually a bad thing" notwithstanding, the goal is to get the most math done per unit time as possible. It's very telling that seemingly no one can give an explanation that holds water of what operations must be tied to one another on a compute device, or why the tensors can be split in one way between CPU and CUDA0 but as soon as you extend the split to involve CUDA1 the performance bombs. We want to run big models on commodity hardware and that means finding the way of distributing the computation among multiple relatively-low-capacity compute units that maximizes the contribution of all the units.

6 replies

cmoncure Jun 6, 2025

version: 3722 (7a8abe2)

bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF and my various repackings of it.
./ik_llama.cpp/build/bin/llama-quantize --repack --repack-pattern "blk.(11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60).ffn_gate_exps","blk.(11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60).ffn_down_exps","blk.(11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60).ffn_up_exps" ~/AIModels/textgen/deepseek-ai_DeepSeek-V3-0324-Q4_K_M-V2-00001-of-00011.gguf ~/AIModels/textgen/repacked5.gguf COPY

VinnyG9 Jun 11, 2025

Day 4 of chasing performance with bespoke repacking and the delicate and mercurial (i.e. broken) configuration args. I'm ready to give up. I tried so many blends of tensor offload parameters and statically repacking my head is spinning. Nothing I tried can reach the high water marks of: 16 TG t/s with --rtr -ot attn=CUDA0 (but bad PP) 200 PP t/s with no repacking and -sm layer -ngl 8 (but bad TG)

I made a repacked quant that converts only the exps tensors running on CPU to _r4 (exps 11...60) and run everything else on CUDA0 and CUDA1 with --sm layer. It should be the best of both worlds, but it's the worst of both worlds: PP 71 and TG 9.

The domain may seem like black magic but at the end of the day all we're doing here is matrix multiplication. My instinct is screaming at me that there's huge amounts of performance left on the table. The wild and frankly shocking comment that "high gpu utilization is actually a bad thing" notwithstanding, the goal is to get the most math done per unit time as possible. It's very telling that seemingly no one can give an explanation that holds water of what operations must be tied to one another on a compute device, or why the tensors can be split in one way between CPU and CUDA0 but as soon as you extend the split to involve CUDA1 the performance bombs. We want to run big models on commodity hardware and that means finding the way of distributing the computation among multiple relatively-low-capacity compute units that maximizes the contribution of all the units.

here fellow OCD, see if this helps

cmoncure Jun 11, 2025

I can't use this approach at all because as soon as I try to involve CUDA1 with -sm none and -mg the code attempts to allocate 1.5 trillion bytes of memory on the GPU (four times the size of the entire model tensors)

saood06 Jun 12, 2025
Collaborator

@cmoncure

Are you building with -DGGML_SCHED_MAX_COPIES=1?

That may be needed for now to avoid that issue, see #437 (comment)

VinnyG9 Jun 13, 2025

I can't use this approach at all because as soon as I try to involve CUDA1 with -sm none and -mg the code attempts to allocate 1.5 trillion bytes of memory on the GPU (four times the size of the entire model tensors)

set ngl to all minus 1 layer

Gaolingx · 2025-06-06T18:13:10Z

Gaolingx
Jun 6, 2025

I am running on 1x epyc 9334qs + 12x ddr5 6400mhz(works on 4800mhz) 48g + 3070 16g, ~10.3t/s TG, ~78t/s PP, it works well, but the VRAM has used about 12GB, I am not sure how large a context window(--ctx-size) I can open.

model: unsloth/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-Q4_K_M-00001-of-00009.gguf
parameter:

./llama-server --model "$MODEL_PATH" \
    --host :: \
    --port 21434 \
    --threads 24 \
    --n-gpu-layers 63 \
    --ctx-size 8192 \
    --mla-use 3 \
    --flash-attn \
    --cache-type-k f16 \
    --run-time-repack \
    --fused-moe \
    --override-tensor exps=CPU

0 replies

ciprianveg · 2025-06-06T18:18:44Z

ciprianveg
Jun 6, 2025

Add -b 4096 -ub 4096 and you will have 3x your pp speed

1 reply

zts9989 Jun 26, 2025

ggml-org/llama.cpp#14325
Thanks.

saood06 · 2025-06-11T15:05:50Z

saood06
Jun 11, 2025
Collaborator

So I finally cooked a quant after sitting on the BF16 for so long.

I ended up going with @ubergarm's imatrix with:
--custom-q "token_embd\.weight=q4_K,attn_k_b.weight=q5_0,attn_*=iq4_ks_r4,output\.weight=q6_K,.*=iq4_k_r4"

Running sweep right now but early impressions are good enough that I may end up using this for a while before attempting some more mixes. (PP seems a bit better, TG seems about the same)

(As a reminder the quant I end up settling on for V3-0324 was a very simple --custom-q "token_embd.weight=iq4_k,.*=iq4_k_r4")

0 replies

zts9989 · 2025-06-26T01:44:18Z

zts9989
Jun 26, 2025

Thank you for the discussion. Sharing my experimental results for your reference.

ggml-org/llama.cpp#14325

6 replies

zts9989 Jun 26, 2025

Absolutely. I can provide that shortly. Please excuse the informal nature of my issue description—it's based more on observational feel than quantitative metrics or official specifications. Much of the feedback I provide within the llama.cpp community tends to reflect practical usage experiences rather than technical documentation standards.

saood06 Jun 26, 2025
Collaborator

Absolutely. I can provide that shortly.

Thanks.

Please excuse the informal nature of my issue description—it's based more on observational feel than quantitative metrics or official specifications. Much of the feedback I provide within the llama.cpp community tends to reflect practical usage experiences rather than technical documentation standards.

No worries, I've seen your feedback to llama.cpp (especially your NUMA stuff) and in my view it is very useful.

zts9989 Jun 26, 2025

My sincere apologies, I retract what I said (Please forgive me for trying to use ik llama.cpp the same way I use the standard llama.cpp, which led to unexpected results. For example, with llama-cli, I didn't add the -cnv switch, so the model went off the rails and generated output I didn't expect).

ik llama.cpp does offer a performance improvement over standard llama.cpp. Speed increased from 17.4 t/s (llama.cpp) to 18.xx t/s (ik).

Apologies again. (I'm really sorry.)

ikawrakow Jun 26, 2025
Maintainer

The recommended batch/u-batch size for ik_llama.cpp with MoE models is 4096 tokens (if you have enough RAM/VRAM; derfault u-batch is perfectly fine for dense models). Performance gains beyond 4096 are quite minor and do not justify the massive increase of compute buffer sizes. Some users go up to 6144. A batch/u-batch size of 16384 is really pushing it.

You are reporting a few percent performance benefit for TG with ik_llama.cpp vs llama.cpp. The difference in PP should be quite a bit larger, no? Interesting you are not looking at that, considering that the whole thread is about batch/u-batch size, which only matters for PP.

Having to add -cnv in ik_llama.cpp is my personal preference. This is how llama.cpp used to behave as well, and I'm annoyed each time I want to use llama-cli in llama.cpp for a quick performance/coherence check when it starts in conversation mode rather than completing my prompt. And because I don't use mainline very often, each time I need to go and check if it was --no-conv or -no-conv to disable the conversation mode. Extremely annoying.

zts9989 Jun 26, 2025

PP (Prompt Processing) speed in ik_llama.cpp is significantly faster than in standard llama.cpp.
At a batch size of 8192, llama.cpp achieves 170 tokens/s while ik_llama.cpp reaches 200 tokens/s (I will provide screenshot evidence later).
At a batch size of 16384, llama.cpp achieves 270 tokens/s, but ik_llama.cpp enters an infinite loop and generates irrelevant outputs. This prevented further performance testing (my screenshot evidence here is insufficient since terminating the process via Ctrl+C doesn’t log PP/TG metrics).

The biggest challenge in offline DeepSeek deployment is PP performance. Compared to enterprise-grade Prefill/Decode (PD)-separated architectures that deliver robust PP and TG performance, single-machine deployments (for individuals/small teams) struggle with long-context (>10K token) processing due to suboptimal PP efficiency.

From my perspective: If GPU VRAM can double PP performance, it’s maximizing resource utilization. Using VRAM to host sparsely activated expert weights (at only ~4% utilization rate) seems wasteful.

The 270 tokens/s at 16384 batch size represents the peak PP performance I achieved after exhaustive tuning of configurations, CPU/GPU combinations, and offline DeepSeek deployment setups.
I still strongly advocate for official support of 16384 batch size.

I sincerely apologize again for my earlier statements.
Looking forward to future updates—I wish both llama.cpp and ik_llama.cpp continued success. Thank you (and apologies for my previous remarks) for your efforts and open-source work, which enable offline LLM usage.

Screenshot evidence will be attached as noted.

zts9989 · 2025-06-26T08:21:32Z

zts9989
Jun 26, 2025

PP (Prompt Processing) speed in ik_llama.cpp is significantly faster than in standard llama.cpp.
At a batch size of 8192, llama.cpp achieves 170 tokens/s while ik_llama.cpp reaches 200 tokens/s (I will provide screenshot evidence later).
At a batch size of 16384, llama.cpp achieves 270 tokens/s, but ik_llama.cpp enters an infinite loop and generates irrelevant outputs. This prevented further performance testing (my screenshot evidence here is insufficient since terminating the process via Ctrl+C doesn’t log PP/TG metrics).

The biggest challenge in offline DeepSeek deployment is PP performance. Compared to enterprise-grade Prefill/Decode (PD)-separated architectures that deliver robust PP and TG performance, single-machine deployments (for individuals/small teams) struggle with long-context (>10K token) processing due to suboptimal PP efficiency.

From my perspective: If GPU VRAM can double PP performance, it’s maximizing resource utilization. Using VRAM to host sparsely activated expert weights (at only ~4% utilization rate) seems wasteful.

The 270 tokens/s at 16384 batch size represents the peak PP performance I achieved after exhaustive tuning of configurations, CPU/GPU combinations, and offline DeepSeek deployment setups.
I still strongly advocate for official support of 16384 batch size.

I sincerely apologize again for my earlier statements.
Looking forward to future updates—I wish both llama.cpp and ik_llama.cpp continued success. Thank you (and apologies for my previous remarks) for your efforts and open-source work, which enable offline LLM usage.

llama.cpp
batsize 4096 pp 133t/s
batsize 8192 pp 170t/s (up to 160k with DeepSeek's solution ggml_cuda_cpy)
batsize 16384 pp 270t/s (up to 80k with DeepSeek's solution ggml_cuda_cpy)

ik llama.cpp
4096 pp 148.7t/s
8192 pp 200t/s
16384 pp na

ik llama.cpp -mla 3 -fmoe -amb 512/1024
4096 177
8192 281
16384 347 36k input
16384 na 50k input

Screenshot evidence will be attached as noted.

3 replies

ikawrakow Jun 26, 2025
Maintainer

I suggest you try -mla 3 -fmoe. If you run out of VRAM, add -amb 512. For the 36k tokens you are processing you should get a very significant performance boost in PP performance.

Thireus Jun 26, 2025

@zts9989 - Yep, similar observations here #477 (reply in thread) ;)

zts9989 Jun 26, 2025

I suggest you try -mla 3 -fmoe. If you run out of VRAM, add -amb 512. For the 36k tokens you are processing you should get a very significant performance boost in PP performance.

zts9989 · 2025-06-26T09:56:07Z

zts9989
Jun 26, 2025

Turns out I was using ik llama.cpp incorrectly all along.
Coming full circle, I'm back to square one:
Please optimize the ggml_cuda_cpy function to support copying tensors larger than 2GB.
Thanks!
(DeepSeek's solution can fully utilize 163,840 context length under -ub 8192 -b 8192 configuration.)"

0 replies

ikawrakow · 2025-06-26T10:38:51Z

ikawrakow
Jun 26, 2025
Maintainer

Please optimize the ggml_cuda_cpy function to support copying tensors larger than 2GB.

I can see what I can do, but I don't feel particularly motivated to engage in hunting down integer overflows and CUDA maximum block size exceeded issues in code that I didn't write myself or at least modified at some point. There are still some performance optimizations left that would be more interesting to work on.

But based on your performance numbers, I estimate you have a 30 GB/s PCI-E, so it takes about 13 seconds to upload all experts stored in RAM to the GPU(s). For u-batch size of 16k tokens you are getting 347 t/s, so the u-batch takes about 47 seconds, so computation is about 34 seconds (and it is easy to verify that this napkin math works for u-batches of 8k and 4k). If you would go to u-batch size of 32k tokens, computation for the batch will at least double, offload time will stay the same, so it will be taking about 81 seconds, so performance will be in the range of 390 t/s. In reality when batch sizes become very large, computing performance goes down due to limited caches, etc, so I'm guessing you will saturate around 350-360 t/s. If I look at the 8k u-batch size, I estimate you have in the range of 30 GB of unused VRAM. Hence, you could have uploaded 5 or 6 layers of experts to the GPU. That would slightly increase your PP performance, and will also boost your TG performance by about 10%.

3 replies

zts9989 Jun 26, 2025

I just gave it a try.
My GPU is connected via PCIe 4.0 x16, so the bandwidth is around 30 GB/s. 347 t/s really seems to be the current limit for my setup. I experimented with a batch size of 32,768 tokens, but performance actually decreased. I also tried pre-loading experts into the available GPU VRAM – the gain was minimal (just from 17.3 to 17.5 t/s).

Thanks for the suggestions though. I've now secured a runtime environment with higher-performance PP.

ikawrakow Jun 26, 2025
Maintainer

Does PR #560 let you compute the context that fails on the main branch with batch/u-batch of 16k tokens?

zts9989 Jun 27, 2025

Does PR #560 let you compute the context that fails on the main branch with batch/u-batch of 16k tokens?

I tried this version, and it still crashed after 131,072. This time it wasn't an error in the cuda cpy, but in the cuda compute. It might really be exceeding the limit.

Thank you a lot.

eous · 2025-07-10T21:20:45Z

eous
Jul 10, 2025

Just a couple benchmark dumps.

Compiled with cmake -B ./build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DCMAKE_CUDA_ARCHITECTURES="120"

ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
llm_load_tensors: ggml ctx size =    1.40 MiB
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size =   497.11 MiB
llm_load_tensors:      CUDA0 buffer size = 65593.61 MiB
llm_load_tensors:      CUDA1 buffer size = 70014.23 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 20480
llama_new_context_with_model: n_batch    = 4096
llama_new_context_with_model: n_ubatch   = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   697.50 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   675.00 MiB
llama_new_context_with_model: KV self size  = 1372.50 MiB, c^KV (f16): 1372.50 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  2872.02 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  2712.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   432.05 MiB
llama_new_context_with_model: graph nodes  = 8184
llama_new_context_with_model: graph splits = 3

main: n_kv_max = 20480, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |    4.994 |   820.25 |   24.307 |    42.13 |
|  4096 |   1024 |   4096 |    6.440 |   636.07 |   24.893 |    41.14 |
|  4096 |   1024 |   8192 |    8.033 |   509.89 |   26.175 |    39.12 |
|  4096 |   1024 |  12288 |    9.646 |   424.65 |   27.750 |    36.90 |
|  4096 |   1024 |  16384 |   11.407 |   359.09 |   28.304 |    36.18 |

Compiled with cmake -B ./build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DCMAKE_CUDA_ARCHITECTURES="120"

main: n_kv_max = 20480, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |    5.002 |   818.89 |   23.962 |    42.73 |
|  4096 |   1024 |   4096 |    6.496 |   630.53 |   24.954 |    41.03 |
|  4096 |   1024 |   8192 |    8.334 |   491.49 |   26.183 |    39.11 |
|  4096 |   1024 |  12288 |    9.765 |   419.47 |   27.661 |    37.02 |
|  4096 |   1024 |  16384 |   11.547 |   354.71 |   28.253 |    36.24 |

Do not really see a difference with -DGGML_CUDA_IQK_FORCE_BF16=1 on my setup but that is sort of expected since on mainline at least bf16 is treated like fp32 and fp32 isn't using the faster fp32 accumulation available in modern cuda (https://docs.nvidia.com/cuda/cublas/index.html#floating-point-emulation-support-overview) last I looked

Hardware

ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
  Device 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

vendor_id	: AuthenticAMD
cpu family	: 25
model		: 24
model name	: AMD Ryzen Threadripper PRO 7975WX 32-Cores
stepping	: 1
cpu MHz		: 4790.945
cache size	: 1024 KB
physical id	: 0
siblings	: 64
core id		: 31
cpu cores	: 32
apicid		: 63
initial apicid	: 63
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d sev sev_es debug_swap
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips	: 7988.12
TLB size	: 3584 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 57 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           750Gi        32Gi        32Gi       182Mi       690Gi       718Gi
Swap:          8.0Gi       859Mi       7.2Gi

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0

6 replies

ubergarm Jul 11, 2025
Author

@ikawrakow

I believe it is ubergarm/DeepSeek-TNG-R1T2-Chimera/IQ1_S at 132.915 GiB (1.699 BPW) quant

This was the command psure:

./build/bin/llama-sweep-bench \
    --model /mnt/models/llama/DeepSeek-TNG-R1T2-Chimera-IQ1_S/DeepSeek-TNG-R1T2-Chimera-IQ1_S-00001-of-00003.gguf \
    -fa -mla 3 -fmoe -amb 512 -mg 0 \
    --ctx-size 20480 \
    -ngl 99 \
    --threads 1 \
    -ub 4096 -b 4096 \
    --warmup-batch

We had some discussions over on level1techs forum here where I got this info.

@eous

Thanks for your report!

I tried my hand at a recipe optimized (hopefully) for your dual RTX PRO 6000 Blackwell's if you are interested in testing. The ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/IQ2_XXS weighs in at 169.590 GiB (2.168 BPW). I believe it will fit full 160k context with full offload on your 192GB VRAM. I'm not sure if it will have enough for full context and -ub 4096 -b 4096 but hopefully.

It is a blend of two of the smallest yet faster CUDA inferencing quants, IQ2_KS and slightly smaller IQ2_XXS for the routed exps. The perplexity is better too at around ~4.0 so should be a little "smarter" than the smaller IQ1_S.

Uploading now, should be live within a couple hours!

ikawrakow Jul 11, 2025
Maintainer

Oh, I see. That's why it is fully loaded in VRAM. Very impressive.

Can one get 800 t/s PP and 40+ t/s TG with any of llama.cpp, KTransformers, vLLM, sglang, ... with this setup?

@ubergarm If you are targeting a fully offloaded setup, isn't IQ2_KT the best option? It beets IQ2_XXS and IQ2_KS in terms of PPL and GPU performance.

ubergarm Jul 11, 2025
Author

@ikawrakow

Can one get 800 t/s PP and 40+ t/s TG with any of llama.cpp, KTransformers, vLLM, sglang, ... with this setup?

eous previously submitted llama-sweep-bench results here for mainline llama.cpp running the slightly larger similar quality DeepSeek-R1-0528-UD-IQ1_S and was peaking out around ~450 tok/sec PP and almost ~50 tok/sec TG.

If you are targeting a fully offloaded setup, isn't IQ2_KT the best option? It beets IQ2_XXS and IQ2_KS in terms of PPL and GPU performance.

I was thinking hard about that IQ2_KT and believe its 2.125 bpw is about right compared to the ~2.1025 blend of IQ2_KS down + IQ2_XXS (gate|up). IQ2_KT is the fastest for PP as I recall, with IQ2_KS just behind. I just wasn't sure about TG performance however as I don't recall a recent comparison for full CUDA offload.

The rig is already setup with some time available today so I'll give it a try adjusting the attn/shexp to use similar BPW KT quants as well. I'll leave that output "head" at iq5_k though I suppose.

It will take a bit longer to cook and calculate perplexity as I can't offload it all, but I'm too curious now not to try! Thanks!

PS. I'm still not sure the best way to handle that odd shaped attn_k_b.*=q4_0... It could go to iq4_nl but I'm honestly not even sure if it is actually used or if the corresponding versions of that tensor are used.

ikawrakow Jul 11, 2025
Maintainer

IQ2_KT TG performance on CUDA is pretty good, at least on my RTX-4080. It is in the same ballpark as IQ2_XXS/IQ2_KS.

The attn_k_b and attn_v_b tensors get used for TG. The attn_kv_b tensors that ik_llama.cpp creates on-the-fly are used for PP (when MLA = 2, 3). To avoid potential accuracy loss due to re-quantization, the attn_kv_b tensors get created as Q8_0.

Surprised to see llama.cpp pulling ahead for TG. I guess one needs to see the exact compositions of these models as theirs may be larger on disk, but use fewer bits during inference.

What about KTransformers? They for sure can do IQ1_S after copy/pasting it from here.

ubergarm Jul 11, 2025
Author

@ikawrakow

The attn_k_b and attn_v_b tensors get used for TG. The attn_kv_b tensors that ik_llama.cpp creates on-the-fly are used for PP (when MLA = 2, 3). To avoid potential accuracy loss due to re-quantization, the attn_kv_b tensors get created as Q8_0.

Interesting, so perhaps I should modify my recipes to make attn_k_b and attn_v_b larger e.g. q8_0 and try to prune off or shrink the attn_kv_b as it is not even used with ik_llama.cpp mla=2/3 then? I've seen some folks suggest offloading it to CPU to free up a little more VRAM...

IQ2_KT TG performance on CUDA is pretty good, at least on my RTX-4080. It is in the same ballpark as IQ2_XXS/IQ2_KS.

Yeah for full offload I believe IQ2_KT will be the way to go. While I'm only able to offload about half the model, still competitive performance despite the trellis running on CPU during TG. Maybe @eous can try the IQ2_KT fully offloaded on those 2x 6000 PRO blackwells for likely now the best available perplexity and speed combination.

sweep-bench-TNG-R1T2-Chimera-IQ2_KT-vs-IQ2_XXS

👈 llama-sweep-bench command and data

model=/mnt/raid/hf/DeepSeek-TNG-R1T2-Chimera-GGUF/IQ2_KT/DeepSeek-TNG-R1T2-Chimera-IQ2_KT-00001-of-00004.gguf
#model=/mnt/raid/hf/DeepSeek-TNG-R1T2-Chimera-GGUF/IQ2_XXS/DeepSeek-TNG-R1T2-Chimera-IQ2_XXS-00001-of-00004.gguf

./build/bin/llama-sweep-bench \
    --model "$model" \
    --no-mmap \
    --ctx-size 12288 \
    -ctk q8_0 \
    -fa -fmoe \
    -mla 3 -amb 512 \
    -ngl 99 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|14|15)\.ffn_.*=CUDA0" \
    -ot "blk\.(16|17|18|19|20|21|22|23|24|25|26|27|28)\.ffn_.*=CUDA1" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --warmup-batch \
    --threads 24

IQ2_KT 171.146 GiB (2.188 BPW) +26 exps offload PPL=3.8887

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	9.737	420.64	76.938	13.31
4096	1024	4096	11.808	346.89	78.850	12.99
4096	1024	8192	14.321	286.02	82.925	12.35

IQ2_XXS 169.590 GiB (2.168 BPW) +26 exps offload PPL=4.0078

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	9.864	415.27	64.423	15.90
4096	1024	4096	12.038	340.27	67.079	15.27
4096	1024	8192	14.536	281.79	71.132	14.40

UPDATE two great reports of running this IQ2_KT fully offloaded: https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826/296

magikRUKKOLA · 2025-07-10T21:31:25Z

magikRUKKOLA
Jul 10, 2025

MOVED: #258 (comment)

3 replies

ubergarm Jul 10, 2025
Author

@magikRUKKOLA

Thanks for bringing the discussion over here, explaining your goal of running as much context as possible up to 160k (model max) on the least VRAM possible, and showing your hardware setup.

hence the for the full context in ik_llama.cpp its required to have at least 48 GB VRAM which is not ideal.

I'm not sure how you came to this conclusion? I just ran ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/IQ2_KS at full 160k context using only 13830MiB VRAM with q8_0 quantized kv-cache... The TG speeds are suffering a bit because I'm not offloading any layers/weights to GPU, but if I were to really run this I'd optimize by offloading some more layers to fill remaining VRAM and increasing -ub 4096 -b 4096 etc...

👈How to run 160k context in under 14GB VRAM + ~200GB RAM

export model=/mnt/raid/hf/DeepSeek-TNG-R1T2-Chimera-GGUF/IQ2_KS/DeepSeek-TNG-R1T2-Chimera-IQ2_KS-00001-of-00005.gguf
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-server \
    --model "$model" \
    --alias ubergarm/DeepSeek-TNG-R1T2-Chimera-IQ2_KS \
    -fa \
    -mla 3 -fmoe -amb 512 \
    --ctx-size 163840 \
    -ctk q8_0 \
    -ngl 0 \
    --parallel 1 \
    --threads 24 \
    --host 127.0.0.1 \
    --port 8080
.
.
.

  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.47 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/62 layers to GPU
llm_load_tensors:        CPU buffer size = 42314.45 MiB
llm_load_tensors:        CPU buffer size = 42634.02 MiB
llm_load_tensors:        CPU buffer size = 42634.02 MiB
llm_load_tensors:        CPU buffer size = 42634.02 MiB
llm_load_tensors:        CPU buffer size = 38222.26 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 163840
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:  CUDA_Host KV buffer size =  5833.12 MiB
llama_new_context_with_model: KV self size  = 5833.12 MiB, c^KV (q8_0): 5833.12 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size = 13569.14 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   334.01 MiB

So you have 3x 3090s and how much RAM? You can easily achieve full 160k context while offloading additional layers for max PP and TG speeds.

magikRUKKOLA Jul 10, 2025

@ubergarm

I'm not sure how you came to this conclusion?

Uh oh. I assumed that the first three layers are always getting loaded onto the main gpu. :)

So you have 3x 3090s and how much RAM?

512 GB RAM

ubergarm Jul 10, 2025
Author

lmao so sorry, I realized after refreshing that it was moved over there so replied there for the next step! xD

yeah you have plenty of ram and VRAM, we can get u going 160k context no problemo

magikRUKKOLA · 2025-07-14T15:07:55Z

magikRUKKOLA
Jul 14, 2025

Lets update the perplexity vs llm size graph. I suggest we use svg.

[EDIT]: this is an old graph. But it contains the code of the generator in the details. The latest version of the code will be here but the most up-to-date graphs could be elsewhere. For example, for the Deepseek-R1-0528 its here: #477 (comment)
And for the Kimi-K2 is here: #477 (reply in thread)

with qr codes [following[ to the huggingface['s short-version domain name hf.co (to save on the QR data)]:

[INSTRUCTIONS TO GENETATE SVG]

the colours for the figures are generated deterministically, via the name of the quant in the config.
the trendline goes via the pareto-optimal quants.

To generate the svg graph perplexity vs llm size keep the data in config.json:

{
  "title": "DeepSeek-R1-0528 (671B) Quantization Analysis",
  "subtitle": "Lower perplexity = Better performance",
  "model_parameters": 671000000000,
  "data": [
    {"name": "IQ1_S_R4", "bpw": 1.664, "ppl": 4.8831, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ1_S_R4"},
    {"name": "IQ2_KT", "bpw": 2.514, "ppl": 3.6378},
    {"name": "IQ2_K_R4", "bpw": 2.799, "ppl": 3.5069, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ2_K_R4"},
    {"name": "UD_Q2_K_XL", "bpw": 2.994, "ppl": 3.5278, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-Q2_K_XL"},
    {"name": "IQ3_KT", "bpw": 3.483, "ppl": 3.3056, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_KT"},
    {"name": "IQ3_KS", "bpw": 3.598, "ppl": 3.2991, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_KS"},
    {"name": "IQ3_K_R4", "bpw": 3.847, "ppl": 3.2730, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_K_R4"},
    {"name": "q4_0", "bpw": 4.508, "ppl": 3.2895, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/Q4_0"},
    {"name": "IQ4_XS (unsloth)", "bpw": 4.2683, "ppl": 3.2598, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/IQ4_XS"},
    {"name": "UD_Q4_K_XL", "bpw": 4.578, "ppl": 3.2483, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-Q4_K_XL"},
    {"name": "IQ4_KS_R4", "bpw": 4.701, "ppl": 3.2286, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ4_KS_R4"},
    {"name": "DQ4_K_R4", "bpw": 5.289, "ppl": 3.2276, "url": "https://huggingface.co/anikifoss/DeepSeek-R1-0528-DQ4_K_R4"},
    {"name": "Q8_0", "bpw": 8.5259260, "ppl": 3.2130, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/Q8_0"}
  ]
}

and use the make.sh script ( ./make.sh --logscale config.json > ppl-log.svg ) to generate the svg file:

#!/usr/bin/env bash
set -euo pipefail

# ------------------------------------------------------------------
# 1.  CLI
# ------------------------------------------------------------------
logscale=0
[[ $# -ge 1 && $1 == "--logscale" ]] && { logscale=1; shift; }
[[ $# -eq 1 ]] || { echo "Usage: $0 [--logscale] config.json" >&2; exit 1; }
config=$1
[[ -f $config ]] || { echo "Config file not found" >&2; exit 1; }

# ------------------------------------------------------------------
# 2.  QR-codes  (never touch stdin)
# ------------------------------------------------------------------
qr_dir="qrcodes"
mkdir -p "$qr_dir"
while IFS= read -r url; do
    [[ -z $url ]] && continue
    short=${url//https:\/\/huggingface.co\//hf.co/}
    hash=$(printf '%s' "$short" | md5sum | awk '{print $1}')
    file="$qr_dir/$hash.svg"
    [[ -f $file ]] && continue
    tmp="$qr_dir/${hash}_tmp.svg"
    qrencode --inline -t svg -l L -s 1 -m 0 "$short" -o "$tmp"
    svgo --multipass -q "$tmp" -o "$file" 2>/dev/null
    rm -f "$tmp"
done < <(jq -r '.data[] | select(.url) | .url' "$config")

# ------------------------------------------------------------------
# 3.  Pre-compute .size and limits
# ------------------------------------------------------------------
mp=$(jq -r '.model_parameters' "$config")
min_ppl=$(jq -r '.data | min_by(.ppl).ppl' "$config")
max_ppl=$(jq -r '.data | max_by(.ppl).ppl' "$config")

sizes=()
while IFS= read -r size; do
    sizes+=("$size")
done < <(jq -r --arg mp "$mp" '.data[] | .bpw * ($mp|tonumber) / 8 / 1024 / 1024 / 1024 | round * 1.0' "$config")

max_sz=0
for size in "${sizes[@]}"; do
    if (( $(echo "$size > $max_sz" | bc -l) )); then
        max_sz=$size
    fi
done
max_round=$(awk -v m="$max_sz" 'BEGIN{r=int((m+63)/64)*64; print (r<64?64:r)}')

title=$(jq -r '.title // "Quantization Analysis"' "$config")
subtitle=$(jq -r '.subtitle // "Lower perplexity = better"' "$config")
[[ $logscale -eq 1 ]] && subtitle+=" (log-difference scale)"

if [[ $logscale -eq 1 ]]; then
    rng=$(awk -v min="$min_ppl" -v max="$max_ppl" 'BEGIN{print max-min}')
    eps=$(awk -v r="$rng" 'BEGIN{print r/100}')
    t_min=$(awk -v e="$eps" 'BEGIN{print log(e)/log(10)}')
    t_range=$(awk -v min="$min_ppl" -v max="$max_ppl" -v e="$eps" \
        'BEGIN{tmax=log(max-min+e)/log(10); print tmax-log(e)/log(10)}')
else
    ppl_rng=$(awk -v min="$min_ppl" -v max="$max_ppl" 'BEGIN{print max-min}')
fi

# ------------------------------------------------------------------
# 4.  Pareto indices
# ------------------------------------------------------------------
pareto_i=()
item_count=$(jq '.data | length' "$config")

for ((i=0; i<item_count; i++)); do
    item=$(jq --argjson i "$i" '.data[$i]' "$config")
    bpw=$(jq -r '.bpw' <<<"$item")
    ppl=$(jq -r '.ppl' <<<"$item")
    size=$(bc <<< "scale=4; $bpw * $mp / 8 / 1024 / 1024 / 1024")
    size=$(printf "%.1f" "$size")

    is_pareto=1
    for ((j=0; j<item_count; j++)); do
        [[ $j -eq $i ]] && continue
        j_item=$(jq --argjson j "$j" '.data[$j]' "$config")
        j_bpw=$(jq -r '.bpw' <<<"$j_item")
        j_ppl=$(jq -r '.ppl' <<<"$j_item")
        j_size=$(bc <<< "scale=4; $j_bpw * $mp / 8 / 1024 / 1024 / 1024")
        j_size=$(printf "%.1f" "$j_size")

        if (( $(echo "$j_ppl <= $ppl" | bc -l) && $(echo "$j_size <= $size" | bc -l) )); then
            if (( $(echo "$j_ppl < $ppl" | bc -l) || $(echo "$j_size < $size" | bc -l) )); then
                is_pareto=0
                break
            fi
        fi
    done
    [[ $is_pareto -eq 1 ]] && pareto_i+=("$i")
done

# ------------------------------------------------------------------
# 5.  SVG header & grid
# ------------------------------------------------------------------
top=100; h=400; gap=50
leg_h=$((70 + item_count * 40))
tot=$((top + h + gap + leg_h + 50))

cat <<EOF
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 800 $tot">
<defs>
    <radialGradient id="halo" cx="50%" cy="50%" r="50%">
        <stop offset="0%"  stop-color="#00c853" stop-opacity="0.20"/>
        <stop offset="100%" stop-color="#00c853" stop-opacity="0"/>
    </radialGradient>
</defs>
<style>
    .axis{stroke:#555;stroke-width:1.5}
    .grid{stroke:#eee;stroke-width:.5}
    .label{font:14px sans-serif;fill:#333}
    .title{font:bold 18px sans-serif;fill:#111}
    .triangle{stroke-width:1.5}
    .legend-item{font:12px monospace}
    .legend-title{font:bold 13px sans-serif}
</style>
<rect width="100%" height="100%" fill="white"/>
<text class="title" x="400" y="30" text-anchor="middle">$title</text>
<text class="label" x="400" y="55" text-anchor="middle" fill="#666">$subtitle</text>
<line x1="100" y1="100" x2="100" y2="500" class="axis"/>
<line x1="100" y1="500" x2="700" y2="500" class="axis"/>
<text class="label" x="400" y="540" text-anchor="middle">Model Size (GB)</text>
<text class="label" x="20" y="300" text-anchor="middle" transform="rotate(-90,20,300)">Perplexity (lower is better)</text>
EOF

# Y-axis grid
if [[ $logscale -eq 1 ]]; then
    for i in {0..4}; do
        frac=$(awk -v i="$i" 'BEGIN{print i/4}')
        ppl=$(awk -v min="$min_ppl" -v eps="$eps" -v tr="$t_range" -v f="$frac" \
              'BEGIN{printf "%.3f", min + 10**(log(eps)/log(10) + f*tr) - eps}')
        y=$(awk -v f="$frac" 'BEGIN{printf "%.1f",500-400*f}')
        text_y=$(awk -v y="$y" 'BEGIN{printf "%.1f", y+5}')
        echo "    <line x1=\"100\" y1=\"$y\" x2=\"700\" y2=\"$y\" class=\"grid\"/>"
        echo "    <text class=\"label\" x=\"80\" y=\"$text_y\" text-anchor=\"end\">$ppl</text>"
    done
else
    for i in {0..4}; do
        ppl=$(awk -v min="$min_ppl" -v max="$max_ppl" -v i="$i" -v r="$ppl_rng" \
              'BEGIN{printf "%.1f",max-i*r/4}')
        y=$((100+i*100))
        text_y=$((y+5))
        echo "    <line x1=\"100\" y1=\"$y\" x2=\"700\" y2=\"$y\" class=\"grid\"/>"
        echo "    <text class=\"label\" x=\"80\" y=\"$text_y\" text-anchor=\"end\">$ppl</text>"
    done
fi

# X-axis grid
for i in $(seq 0 64 "$max_round"); do
    x=$(awk -v s="$i" -v mr="$max_round" 'BEGIN{printf "%.1f",100+(s/mr)*600}')
    echo "    <line x1=\"$x\" y1=\"100\" x2=\"$x\" y2=\"500\" class=\"grid\"/>"
    [[ $((i%256)) -eq 0 || $i -eq $max_round ]] && \
        echo "    <text class=\"label\" x=\"$x\" y=\"520\" text-anchor=\"middle\">$i</text>"
done

# ------------------------------------------------------------------
# 6.  Helpers
# ------------------------------------------------------------------
to_xy() {
    local sz=$1 pl=$2 x y
    x=$(awk -v s="$sz" -v mr="$max_round" 'BEGIN{printf "%.1f",100+(s/mr)*600}')
    if [[ $logscale -eq 1 ]]; then
        y=$(awk -v p="$pl" -v min="$min_ppl" -v eps="$eps" \
            -v tmin="$(awk -v e="$eps" 'BEGIN{print log(e)/log(10)}')" -v tr="$t_range" \
            'BEGIN{d=p-min;printf "%.1f",500-400*((log(d+eps)/log(10)-tmin)/tr)}')
    else
        y=$(awk -v p="$pl" -v min="$min_ppl" -v r="$ppl_rng" \
            'BEGIN{printf "%.1f",500-(p-min)*400/r}')
    fi
    echo "$x $y"
}

trend="M"
sorted_pareto_i=($(for i in "${pareto_i[@]}"; do
    item=$(jq --argjson i "$i" '.data[$i]' "$config")
    bpw=$(jq -r '.bpw' <<<"$item")
    size=$(bc <<< "scale=4; $bpw * $mp / 8 / 1024 / 1024 / 1024")
    size=$(printf "%.1f" "$size")
    echo "$size $i"
done | sort -n | awk '{print $2}'))

for i in "${sorted_pareto_i[@]}"; do
    item=$(jq --argjson i "$i" '.data[$i]' "$config")
    bpw=$(jq -r '.bpw' <<<"$item")
    ppl=$(jq -r '.ppl' <<<"$item")
    size=$(bc <<< "scale=4; $bpw * $mp / 8 / 1024 / 1024 / 1024")
    size=$(printf "%.1f" "$size")
    read x y < <(to_xy "$size" "$ppl")
    trend+=" $x $y"
done

# ------------------------------------------------------------------
# 7.  Draw points (ascending ppl)
# ------------------------------------------------------------------
sorted_indices=($(jq -r '.data | sort_by(.ppl) | keys_unsorted[]' "$config"))
ly=$((top + h + gap + 70))

for idx in "${sorted_indices[@]}"; do
    item=$(jq --argjson i "$idx" '.data[$i]' "$config")
    name=$(jq -r '.name' <<<"$item")
    bpw=$(jq -r '.bpw' <<<"$item")
    ppl=$(jq -r '.ppl' <<<"$item")
    size=$(bc <<< "scale=4; $bpw * $mp / 8 / 1024 / 1024 / 1024")
    size=$(printf "%.1f" "$size")
    url=$(jq -r '.url // ""' <<<"$item")
    read x y < <(to_xy "$size" "$ppl")

    xl=$(awk -v x="$x" 'BEGIN{printf "%.1f",x-10}')
    yt=$(awk -v y="$y" 'BEGIN{printf "%.1f",y-10}')
    xr=$(awk -v x="$x" 'BEGIN{printf "%.1f",x+10}')

    c=$(printf '%s' "$name" | md5sum | awk '{print "#"substr($1,1,6)}')
    dc=$(printf '%s' "${c#?}" | awk '{printf "#%02x%02x%02x", strtonum("0x"substr($0,1,2))*8/10, strtonum("0x"substr($0,3,2))*8/10, strtonum("0x"substr($0,5,2))*8/10}')

    is_pareto=0
    for i in "${pareto_i[@]}"; do
        [[ $i -eq $idx ]] && { is_pareto=1; break; }
    done

    if [[ $is_pareto -eq 1 ]]; then
        echo "    <circle cx=\"$x\" cy=\"$y\" r=\"14\" fill=\"url(#halo)\"/>"
        echo "    <polygon points=\"$x,$y $xl,$yt $xr,$yt\" class=\"triangle\" fill=\"$c\" stroke=\"$dc\"/>"
    else
        echo "    <polygon points=\"$x,$y $xl,$yt $xr,$yt\" class=\"triangle\" fill=\"$c\" stroke=\"$dc\"/>"
    fi

    qr=""
    if [[ -n $url ]]; then
        short=${url//https:\/\/huggingface.co\//hf.co/}
        hsh=$(printf '%s' "$short" | md5sum | awk '{print $1}')
        [[ -f $qr_dir/$hsh.svg ]] && qr_base64=$(base64 -w 0 "$qr_dir/$hsh.svg" 2>/dev/null || base64 "$qr_dir/$hsh.svg" | tr -d '\n')
        qr="<image x=\"450\" y=\"$((ly-10))\" width=\"32\" height=\"32\" href=\"data:image/svg+xml;base64,$qr_base64\"/>"
    fi

    points+="\n        <polygon points=\"70,$ly 60,$((ly+10)) 80,$((ly+10))\" fill=\"$c\""
    if [[ $is_pareto -eq 0 ]]; then
        points+=" stroke=\"#ff000050\" stroke-width=\"2\""
    else
        points+=" stroke=\"$dc\""
    fi
    points+="/>"
    points+="\n        <text class=\"legend-item\" x=\"100\" y=\"$((ly+10))\">$name: $bpw bpw, $ppl ppl</text>"
    [[ -n $qr ]] && points+="\n        $qr"
    ly=$((ly+40))
done
points=$(echo -e "$points")

# ------------------------------------------------------------------
# 8.  Trendline & legend
# ------------------------------------------------------------------
[[ ${#trend} -gt 1 ]] && \
    echo "    <path d=\"$trend\" fill=\"none\" stroke=\"#00c853\" stroke-width=\"1.5\" stroke-dasharray=\"6,3\" stroke-opacity=\"0.5\"/>"

cat <<EOF
    <rect x="50" y="$((top + h + gap))" width="700" height="$((leg_h-5))" fill="#f8fafc" stroke="#e2e8f0" rx="5"/>
    <text class="legend-title" x="70" y="$((top + h + gap + 25))">Quantization Details</text>
    <g class="legend">$points
    </g>
</svg>
EOF

23 replies

magikRUKKOLA Jul 16, 2025

Just got the Kimi-K2-Instruct-IQ2_KS

cool thanks. updated

ubergarm Jul 17, 2025
Author

@magikRUKKOLA if you want a couple experimental quants based on ik's new IQ1_KS 1.75 BPW SOTA trellis quant implantation i have the numbers. these are not yet available on HF as its not merged into main and could possibly change. also the KT quants tend to run faster TG on CUDA backend as calculating the trellis on CPU actually breaks the rule of "TG is limited by ram bandwidth" hahah

Kimi-K2-Instruct-smol-IQ1_KT
- 214.182 GiB (1.792 BPW)
- Final estimate: PPL = 4.3623 +/- 0.02432
Kimi-K2-Instruct-IQ1_KT
- 228.948 GiB (1.915 BPW)
- Final estimate: PPL = 4.1310 +/- 0.02266

the -smol here is how i indicate the ffn_down_exps was also IQ1_KT same size as the ffn_(up|gate)_exps. The "normal" IQ1_KT used IQ2_KT for ffn_down_exps as i usually would do.

ubergarm Jul 17, 2025
Author

@magikRUKKOLA

Hey curious where you got that Kimi-K2-Instruct-UD-IQ3_XXS perplexity from? I was trying to get data on their Kimi-K2-Instruct-UD-IQ1_S but its broken, but got a tip to disable -fmoe which got it running perplexity correctly again. I think I cc'd you over ont hat thread too hah sorry so many tiny little comment boxes to get lost in!

I'll share more data on what I find! thanks!

magikRUKKOLA Jul 17, 2025

Hey curious where you got that Kimi-K2-Instruct-UD-IQ3_XXS perplexity from?

from ik_llama.cpp as usual

magikRUKKOLA Jul 19, 2025

The DeepSeek-R1-0528 graph does show pretty well how Unsloth quants (and even more so Q4_0) are not on the Pareto frontier of the quality vs size compromise.

Ok cool, now the trendline goes only via the Pareto-compatible quants (and the non-Pareto quants are highlighted in reddish border in the legend).

magikRUKKOLA · 2025-07-16T15:56:08Z

magikRUKKOLA
Jul 16, 2025

R1 stats (THIREUS quants added).

{
  "title": "DeepSeek-R1-0528 (671B) Quantization Analysis",
  "subtitle": "Lower perplexity = Better performance",
  "model_parameters": 671000000000,
  "data": [
    {"name": "IQ1_S_R4", "bpw": 1.664, "ppl": 4.8831, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ1_S_R4"},
    {"name": "THIREUS-1.9364", "bpw": 1.9364, "ppl": 4.3533, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-1.9364bpw-4.3533ppl.151GB-GGUF_11GB-GPU_140GB-CPU.3c88ec6_9fd615d.recipe"},
    {"name": "IQ2_KT", "bpw": 2.514, "ppl": 3.6378},
    {"name": "THIREUS-2.7840", "bpw": 2.7840, "ppl": 3.4341, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-2.7840bpw-3.4341ppl.217GB-GGUF_14GB-GPU_203GB-CPU.3c88ec6_02247be.recipe"},
    {"name": "IQ2_K_R4", "bpw": 2.799, "ppl": 3.5069, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ2_K_R4"},
    {"name": "UD_Q2_K_XL", "bpw": 2.994, "ppl": 3.5278, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-Q2_K_XL"},
    {"name": "THIREUS-3.1027", "bpw": 3.1027, "ppl": 3.3372, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1027bpw-3.3372ppl.242GB-GGUF_11GB-GPU_231GB-CPU.3c88ec6_adc8101.recipe"},
    {"name": "THIREUS-3.1446", "bpw": 3.1446, "ppl": 3.3257, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1446bpw-3.3257ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_7d1efe1.recipe"},
    {"name": "THIREUS-3.1447", "bpw": 3.1447, "ppl": 3.3269, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1447bpw-3.3269ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_4b1254a.recipe"},
    {"name": "THIREUS-3.1525", "bpw": 3.1525, "ppl": 3.3251, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1525bpw-3.3251ppl.246GB-GGUF_15GB-GPU_231GB-CPU.3c88ec6_5a3fc0f.recipe"},
    {"name": "THIREUS-3.1740", "bpw": 3.1740, "ppl": 3.3253, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1740bpw-3.3253ppl.248GB-GGUF_17GB-GPU_231GB-CPU.3c88ec6_6cf3a72.recipe"},
    {"name": "THIREUS-3.1858", "bpw": 3.1858, "ppl": 3.3261, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1858bpw-3.3261ppl.249GB-GGUF_18GB-GPU_231GB-CPU.3c88ec6_027b7ff.recipe"},
    {"name": "THIREUS-3.2564", "bpw": 3.2564, "ppl": 3.2985, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.2564bpw-3.2985ppl.254GB-GGUF_15GB-GPU_239GB-CPU.3c88ec6_7c0be1e.recipe"},
    {"name": "IQ3_KT", "bpw": 3.483, "ppl": 3.3056, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_KT"},
    {"name": "THIREUS-3.5652", "bpw": 3.5652, "ppl": 3.2734, "url": "https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.5652bpw-3.2734ppl.278GB-GGUF_14GB-GPU_264GB-CPU.3c88ec6_9b5660b.recipe"},
    {"name": "IQ3_KS", "bpw": 3.598, "ppl": 3.2991, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_KS"},
    {"name": "THIREUS-3.6766", "bpw": 3.6766, "ppl": 3.2741, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13781700"},
    {"name": "IQ3_K_R4", "bpw": 3.847, "ppl": 3.2730, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_K_R4"},
    {"name": "THIREUS-3.976", "bpw": 3.976, "ppl": 3.2452, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13798329"},
    {"name": "IQ4_XS (unsloth)", "bpw": 4.2683, "ppl": 3.2598, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/IQ4_XS"},
    {"name": "q4_0", "bpw": 4.508, "ppl": 3.2895, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/Q4_0"},
    {"name": "UD_Q4_K_XL", "bpw": 4.578, "ppl": 3.2483, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-Q4_K_XL"},
    {"name": "IQ4_KS_R4", "bpw": 4.701, "ppl": 3.2286, "url": "https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ4_KS_R4"},
    {"name": "DQ4_K_R4", "bpw": 5.289, "ppl": 3.2276, "url": "https://huggingface.co/anikifoss/DeepSeek-R1-0528-DQ4_K_R4"},
    {"name": "THIREUS-6.2478", "bpw": 6.2478, "ppl": 3.2240, "url": "https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13781560"},
    {"name": "Q8_0", "bpw": 8.5259260, "ppl": 3.2130, "url": "https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/Q8_0"}
  ]
}

85 replies

ubergarm Jul 19, 2025
Author

@magikRUKKOLA

I just noticed most of the unsloth quants were modified somehow, but not sure what changed. Is there a reddit post or blogpost explaining? Sometimes they do that just for the GGUF metadata and doesn't effect the tensors but still requires full upload psure.

ikawrakow Jul 19, 2025
Maintainer

If you download new Unsloth quants, please first make a gguf-dump of the model you have before downloading the new model. Then do a gguf-dump on the new model, compare, and post the difference. I think many people will be curious to know what was changed that was so important that Unsloth felt it is worth making people re-download hundreds of GB of data.

firecoperana Jul 19, 2025
Collaborator

https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF/discussions/7
It's tool calling related. Just need to re-download the first GGUF file.

magikRUKKOLA Jul 19, 2025

It's tool calling related. Just need to re-download the first GGUF file.

This story doesn't add up like AT ALL.
Check this out:

https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF/commit/ac691362ab1d5c071d82a115b76ceb0b3ed3b4d3

UD-IQ3_XXS/Kimi-K2-Instruct-UD-IQ3_XXS-00003-of-00009.gguf CHANGED
version https://git-lfs.github.com/spec/v1
oid sha256:7e756f2fb141dc6b9dc76905485b82997b03537594eeed6f00b000cb9ca8118e
size 48137425536

version https://git-lfs.github.com/spec/v1
oid sha256:2f0fd3546428437dc801c86b3cf5ee38c4b7043874dcc9c61c1c1df97c6fcf7d
size 48711897728

Its plus ~~+6GB~~ +600MB to a third file. Is it a tool calling template update? No, its not.

Same thing goes to the quant I was downloaded (UD-Q3_K_XL). It grew in size.

ubergarm Jul 19, 2025
Author

fwiw i've observed that Kimi-K2-Instruct is very sensitive to attn/shexp/blk.0.ffn.* quantization (or possibly just attn). i too would like to see the difference in the recipes. i've collected a lot more data and hope to update my graph soon after one more test quant finishes cooking.

i believe it is possible to do 'revisions' using git branches on hugging face and hope to figure that out to release some updated versions of my quants perhaps

magikRUKKOLA · 2025-07-16T21:36:20Z

magikRUKKOLA
Jul 16, 2025

Is there any way to predict the performance of the quant (prefill and decode) based solely on the types of the quants used?

0 replies

anikifoss · 2025-07-16T21:58:00Z

anikifoss
Jul 16, 2025

Is there any way to predict the performance of the quant (prefill and decode) based solely on the types of the quants used?

Yes: RAM bandwidth

Take active_parameters * bits_per_parameter: that's how much data you need to move from RAM to CPU per token.

For example, Qwen3-30B-A3B has 3 billion active_parameters:

For Q8_0, you need to move 8 bits = 1 byte per active_parameter
- 3GB per token
For Q4_0, you need to move 4 bits = 0.5 byte per active_parameter
- 1.5GB per token

You can then measure how many tokens per second you get with Qwen3-30B-A3B, and calculate your system's memory bandwidth (often this is around 80% of the theoretical possible bandwidth).

Once you have the system's effective memory bandwidth, you can then reverse the calculation to estimate tokens per second you will get with X active parameters:

tokens_per_second = effective_system_bandwidth / (active_parameters * bits_per_parameter)

Things get a little more tricky when you have a GPU in the mix. The same formula usually applies to GPU and VRAM (uncless the card is very weak at compute, like some older cards). However, if you have both GPU and CPU working together, then the slowest one (CPU) will be your bottleneck. Then you need to figure out how many active parameters will go on the GPU and how many will go on the CPU.

4 replies

magikRUKKOLA Jul 16, 2025

A stupid question -- are you basically saying that IQ1_S_R4 quant will be twise as fast as say, IQ3_KT ? both in prefill and decode ? :)

anikifoss Jul 16, 2025

are you basically saying that IQ1_S_R4 quant will be twise as fast as say, IQ3_KT ? both in prefill and decode ? :)

Generally, yes. There is a small penalty you may have to pay for I-Quants, because they are little compute heavy.

ubergarm Jul 17, 2025
Author

It is generally more easy to predict TG by RAM bandwidth as aniki mentioned.

PP can vary quite a bit depending on how you run it as right it is more CPU bottle-necked.

And yeah some quants don't have MMQ kernels (basically can multiply it directly without dequantizting it which can be faster much of the time)..

There are so many variables I tend to just pick a couple quants that fits in my rig at the desired context and llama-sweep-bench a bunch of combinations.

its usually all trade-offs like the most exciting things in engineering thereis not "one right answer" imo

saood06 Jul 17, 2025
Collaborator

Things get a little more tricky when you have a GPU in the mix.

Things also get complicated when you have a NUMA system.

anikifoss · 2025-07-16T22:20:13Z

anikifoss
Jul 16, 2025

Prompt Processing uses a clever workaround to cheat the RAM bandwidth limitation. You multiply several tokens at the same time, that way you are re-using the data in the CPU cache, side-stepping the RAM bandwidth limit.

1 reply

magikRUKKOLA Jul 16, 2025

Prompt Processing

Unrelated question: have you seen MLA matrix absorbtion ( #599 (comment) ) implemented properly somewhere?

anikifoss · 2025-07-17T16:54:59Z

anikifoss
Jul 17, 2025

I have four units of MI50-32GB arriving soon (was a great deal, $150 each). Together with RTX-5090, it should give me 160GB of VRAM. So I can try benchmarking IQ1_S fully from VRAM.

Does anyone have experience with MI50s or running a mixed ROCM/CUDA setup?

If I can get MI50s working working, I'll try hooking up 22 of them into one system, for a total of 704GB VRAM. That should be enough to run my chunky Kimi-K2 quant. Will need to limit power consumption to 120W stay within 1600x2 Watts.

I found some articles online with mixed feedback about MI50s, would really appreciate if someone could share the first hand experience!

15 replies

anikifoss Jul 17, 2025

MI50s are from 2018, they are PCI 3.0, so the second device would work! There is a reason they are $150 each 🥲

Panchovix Jul 17, 2025

For X16 to PCIe 4X4 4.0 you can get some PCIe X16 to 4X M2 adapters PCIe4.0 and then 4 M2 to PCIe adapters. It works fine. More expensive tho.

anikifoss Jul 17, 2025

@Panchovix thanks for sharing the numbers, 23.29 tokens/sec with 15360 context is impressive!

Ph0rk0z Jul 19, 2025

They are $150 due to the cooling and software hassle :P

I would have loved these instead of P40s, they were $500+ when the former were $150 themselves.

anikifoss Jul 19, 2025

Any issues with P40? I considered those, but the RAM density was not high enough to load something like DeepSeek with the max loadout.

ikawrakow · 2025-07-18T17:09:16Z

ikawrakow
Jul 18, 2025
Maintainer

Btw, because many people in this thread are running calculations with models that contained IQ1_M quants, and that lead to a crash when using -fmoe, this is now fixed via PR #630 that just got merged. I.e., you can now use -fmoe for those models again.

0 replies

DeepSeek-R1-0528 ik quants! #477

Uh oh!

Uh oh!

What

Info

Benchmarks

Perplexity

Speed

Conclusion

Replies: 25 comments · 218 replies

Uh oh!

Uh oh!

ubergarm Jun 2, 2025 Author

Uh oh!

Uh oh!

ubergarm Jun 2, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ubergarm Jun 1, 2025 Author

Uh oh!

ubergarm Jun 1, 2025 Author

-DGGML_CUDA_IQK_FORCE_BF16=1 -ot "blk\.(3|4|5|6|7|8|9)\.ffn_.*=CUDA0" -ot exps=CPU

-DGGML_CUDA_IQK_FORCE_BF16=1 -ot exps=CPU

-DGGML_CUDA_IQK_FORCE_BF16=0 -ot exps=CPU

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ubergarm Jun 3, 2025 Author

Uh oh!

Uh oh!

ubergarm Jun 4, 2025 Author

Uh oh!

Uh oh!

ubergarm Jun 1, 2025 Author

Qantization Effects of attn/shexp on Perplexity

Motivation

Results

Methodology and Data

quantization script

resultant test quants

perplexity test script

raw data in JSON format

python script for plotting

Conclusion

Uh oh!

Uh oh!

ubergarm Jun 2, 2025 Author

Uh oh!

ikawrakow Jun 2, 2025 Maintainer

Uh oh!

Uh oh!

ubergarm Jun 2, 2025 Author

Uh oh!

saood06 Jun 4, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 25 comments 218 replies

ubergarm Jun 2, 2025
Author

ubergarm Jun 2, 2025
Author

ubergarm Jun 1, 2025
Author

ubergarm
Jun 1, 2025
Author

`-DGGML_CUDA_IQK_FORCE_BF16=1 -ot "blk\.(3|4|5|6|7|8|9)\.ffn_.*=CUDA0" -ot exps=CPU`

`-DGGML_CUDA_IQK_FORCE_BF16=1 -ot exps=CPU`

`-DGGML_CUDA_IQK_FORCE_BF16=0 -ot exps=CPU`

ubergarm Jun 3, 2025
Author

ubergarm Jun 4, 2025
Author

ubergarm
Jun 1, 2025
Author

Qantization Effects of `attn`/`shexp` on Perplexity

ubergarm Jun 2, 2025
Author

ikawrakow Jun 2, 2025
Maintainer

ubergarm Jun 2, 2025
Author

saood06 Jun 4, 2025
Collaborator