-
Notifications
You must be signed in to change notification settings - Fork 70
CUDA implementation for IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4 #461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
~10% slower than iq4_k.
We are now within 3% of iq4_k
~3% slower than iq5_k.
I was looking to use the $ grep repacked examples/quantize/quantize.cpp | grep KS
{ "IQ4_KS_R4",LLAMA_FTYPE_MOSTLY_IQ4_KS_R4,"IQ4_KS repacked", },
{ "IQ5_KS_R4",LLAMA_FTYPE_MOSTLY_IQ5_KS_R4,"IQ5_KS repacked", },
$ grep KS_R4 include/llama.h
LLAMA_FTYPE_MOSTLY_IQ4_KS_R4 = 345, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ5_KS_R4 = 350, // except 1d tensors
$ grep KS_R4 ggml/src/ggml-cuda/convert.cu
case GGML_TYPE_IQ4_KS_R4:
case GGML_TYPE_IQ5_KS_R4:
case GGML_TYPE_IQ4_KS_R4:
case GGML_TYPE_IQ5_KS_R4:
case GGML_TYPE_IQ4_KS_R4:
case GGML_TYPE_IQ5_KS_R4: For now I'll go with |
No I haven't done Perhaps just use Or, if you have the patience to wait for |
I got a report from the wild that FORCE_BF16=1 gave a speed boost and confirmed that it does seem to do so at least in this specific hardware configuration and this specific quant for PP. I added a graph and data to the R1-0528 discussion: #477 (comment) This benchmark also confirms offloading additional
OOOH! I just realized you've been doing the I wish I had a way to compare apples-apples between exl3 and ik_llama.cpp but as there is no llama-cpp-python bindings for ik_llama.cpp. (i tried for half an hour to get it to work with older versions but things had diverged too much already a year ago so gave up). Regardless, I'll read up more on your implementation of iq2_kt and check the code for the mcg value etc. Thanks! |
The On the CPU performance is not quite as good. PP performance is getting there, but TG is slooow on the CPU. I did look a bit into the plots in the ExLlamaV3 repository. I absolutely cannot confirm the PPL plots for LLaMA-3-70B. I used the 70B model because in my experience when overfitting is going on, the overfitting is typically based on the small models (nobody has the patience to fool around with meta parameters with testing done on a large model). Hence, color me skeptical about the ExLlamaV3 results. The thing about apples-to-apples is that if you use |
I like KT quants too and tried subbing out 3INST parameters with superior ones (since LCG from QTIP paper x = 89226354 * x + 64248484 can't be optimal) but for some reason, all the better parameters with lower MSE both in synthetic trellis codes (without rotations) or in EXL3 (with rotations) don't show improvement when I slot them into ik_llama, recompile, quant, and test models. Could current KT code paths be implicitly tuned to expect certain behavior the default parameters provide? I haven't gone through the code super carefully but at first glance I can't immediately figure this out. I've found dozens of better decoder params for 3INST that show ~5% reduction in MSE for abstract TC but they seem to do unreasonable harm to IQx_KT quants rather than help them or leave them mostly unchanged, which is why I suspect there must be some fine tuning on some level. Maybe it's the "slop" factors added to dequantize_block_iq2_kt and dequantize_block_iq3_kt and dequantize_block_iq4_kt?
Are the 5%, 1%, and 1% just something added to avoid overflow or to use the distribution slightly more optimally? Should they be changed if I adjust the multiplier in 3INST? What else (if anything) would need to change? [ BTW there seem to be some small inconsistencies between convert.cu and iqk_gemm_ktquants.cpp where the former uses 5%, 1%, 1% and the latter still uses 5%, 1.5%, 1%. ] Also, if you want KT quants to run even faster, the QTIP paper mentions how to combine the 2 masks in 3INST (AND + XOR) into a single LOP3 instruction. It needs to be added in asm because nvcc can't find this optimization but it improves speed by a measurable amount.
would become something like this (with slightly different asm input params if you want to use your current variable names)
|
The quantization implementation does not attempt to find the provably optimum solution to the RMSE minimization problem for 2 reasons:
Hence, a heuristics is used to determine "optimum" quants. The heuristics is tuned to the specific values being produced by the trellis. But I don't expect you to observe "unreasonable harm", just perhaps a somewhat lower quality quantization. I did play quite a bit with different generators when working on #113. For instance, I experimented with using the sum of the 8 bytes of 64-bit random variables. This has many advantages to the QTIP trellises:
But despite the "theoretical advantage", I observed lower quality quantization. My guess: model weights are not really Gaussian, the outliers are very important, and the "3INST" trellis somehow fits better to real world model weights. Concerning
I noticed it too in the QTIP paper, but I did not take it seriously because an integer multiplication is quite a bot slower than a xor. But if you say that you observe a measurable performance difference, I'll try it. Thanks! |
OK, using the inline assembly instruction results in a 0.6% speedup for TG-128 (178.7 t/s vs 177.5 t/s on my RTX-4080 for |
This closed PR probably isn't the place for this, but given the previous conversation around optimizing the KT quants I have my first KT quant perplexity/kld comparison now! DeepSeek-R1-0528-IQ2_KT
PerplexityCompared to my other ik_llama.cpp quants in this model collection made with same imatrix corpus with KLDCompared to my other ik_llama.cpp quant collection made with same imatrix corpus with very short unreleased "novel text" kld text corpus against q8_0 baseline. The only other piece of data I have is using IQ4_KT as attention tensor in otherwise Q4_0 quant which is in the R1-0528 discussion. Looking forward playing with these some more and seeing how they perform across various models as more data becomes available. Thanks. |
The
IQX_K
quants and their row-interleaved siblingsIQX_K_R4
offer better quantization quality than corresponding i-, k-, or legacy quants at the same bpw.IQX_K_R4
quants have better CPU performance but cannot be used on CUDA as there is no GEMM/GEMV implementation. Hence, "quant cookers" need to releaseIQX_K
quantized model, so users can use them on their GPUs, but that requires users doing CPU-ony inference to repack the model to take advantage of the better CPU performance. In addition, @ubergarm has released variousIQK_X_R4
quantized models (see here, and those cannot be used for GPU inference.To remove such inconvenience, this PR adds CUDA implementation for the row-interleaved quants
IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4
. I'll follow up with a separate PR forIQ2_KS_R4, IQ4_KS_R4
andIQ5_KS_R4
.For now GEMM is implemented via dequantize + cuBLAS. I may add quantized GEMM (a.k.a. MMQ) later.
Note: because of the above, if you want to use a
IQX_K_R4
DeepSeek-V3/R1 model on the GPU, you may need to build with-DGGML_CUDA_IQK_FORCE_BF16=1
to forcebf16
arithmetic with cuBLAS asfp16
has been noted to lead to numerical instabilities and garbled output. I did not enableGGML_CUDA_IQK_FORCE_BF16
by default as it reduces prompt processing performance while, as far as I can tell,bf16
is only required for DeepSeek.