Skip to content

quantize: Handle user-defined quantization levels for additional tensors #12511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Apr 13, 2025

Conversation

EAddario
Copy link
Contributor

@EAddario EAddario commented Mar 22, 2025

This PR adds the ability to quantize other tensors, beyond token-embedding and output-tensor. It handles most of the supported architectures. except Mamba, RWKV6, RWKV6QWEN2 and T5 to avoid having too many command options, but can add as well if maintainers request it.

For full background on the PR, please see: Squeezing Tensor Bits: the quest for smaller LLMs

@EAddario EAddario changed the title Handle user-defined quantization levels for additional tensors quantize: Handle user-defined quantization levels for additional tensors Mar 22, 2025
@max-krasnyansky
Copy link
Collaborator

How about we add a more generic --tensor-type tensor_name_pattern=type.
@slaren has the PR #11397 that overrides backend mapping per tensor.
Let's make this one similar (same patters, etc). That way we'll be able to override specific layers (if needed).

@EAddario
Copy link
Contributor Author

That's an excellent idea! and it'll allow to add all supported tensor types (50+) without creating a mess of parameters. Plus, it will give me something to do over the weekend 😆

@jukofyork
Copy link
Collaborator

How about we add a more generic --tensor-type tensor_name_pattern=type. @slaren has the PR #11397 that overrides backend mapping per tensor. Let's make this one similar (same patters, etc). That way we'll be able to override specific layers (if needed).

Yeah, I think this is definitely the way to go - the regex support of that PR gives really good flexibility.

@ddh0
Copy link
Contributor

ddh0 commented Apr 3, 2025

I started a discussion thread if anyone's interested, so we don't clog this PR: #12741

@EAddario
Copy link
Contributor Author

EAddario commented Apr 3, 2025

I'll add results of my weekend testing there as well

@ubergarm
Copy link

ubergarm commented Apr 4, 2025

fwiw I've been doing this with ik_llama.cpp llama-quantize --custom-q feature with good success. Just in case there is any desire to keep this PRs syntax compatible or not.

Specifying exact quants per tensor becomes more important now as -ot is merged, and the MLA PR is in the works here. This allows trading off quality and performance specific to target hardware configurations (e.g. how much VRAM to leave available for MLA kv-cache with using -ot exps=CPU etc).

I have a couple custom quants up on huggingface ubergarm/DeepSeek-V3-0324-GGUF that use this technique.

Here is an example bash script recipe for an experimental CPU only speed blend:

CPU only quant performance blend V3-0324 recipe

NOTE: mainline llama.cpp doesn't have all these quants, but you can see how regex tensor<->quant mappings via --custom-q allows easy testing and maintenance of recipe scripts.

#!/usr/bin/env bash

# CPU only inference blend

# Notes:
# https://github.com/ikawrakow/ik_llama.cpp/issues/296#issuecomment-2765210993
# https://github.com/ikawrakow/ik_llama.cpp/issues/296#issuecomment-2768567062
custom="
# Token embedding and output tensors
# note token_embd cannot be repacked quant type
token_embd\.weight=iq6_k
output\.weight=iq5_k_r4
output_norm\.weight=iq5_k_r4

# First 3 dense layers (0-3)
blk\.[0-2]\.attn_k_b.*=q6_0_r4
blk\.[0-2]\.attn_.*=iq5_k_r4
blk\.[0-2]\..*=iq5_k_r4

# All attention, norm weights, and bias tensors for MoE layers (3-60)
# Except blk.*.attn_k_b.weight is not divisible by 256 and no iq6_k so go with q6_0_r4 for CPU only speed boost
blk\.[3-9]\.attn_k_b.*=q6_0_r4
blk\.[1-5][0-9]\.attn_k_b.*=q6_0_r4
blk\.60\.attn_k_b.*=q6_0_r4

blk\.[3-9]\.attn_.*=iq5_k_r4
blk\.[1-5][0-9]\.attn_.*=iq5_k_r4
blk\.60\.attn_.*=iq5_k_r4

blk\.[3-9]\.ffn_norm\.weight=iq5_k_r4
blk\.[1-5][0-9]\.ffn_norm\.weight=iq5_k_r4
blk\.60\.ffn_norm\.weight=iq5_k_r4

blk\.[3-9]\.exp_probs_b\.bias=iq5_k_r4
blk\.[1-5][0-9]\.exp_probs_b\.bias=iq5_k_r4
blk\.60\.exp_probs_b\.bias=iq5_k_r4

# Shared Experts (3-60)
blk\.[3-9]\.ffn_down_shexp\.weight=iq5_k_r4
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq5_k_r4
blk\.60\.ffn_down_shexp\.weight=iq5_k_r4

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq5_k_r4
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq5_k_r4
blk\.60\.ffn_(gate|up)_shexp\.weight=iq5_k_r4

# Routed Experts (3-60)
# First 16 layers are more sensitive so keep larger
blk\.[3-9]\.ffn_down_exps\.weight=iq5_k_r4
blk\.[1][0-9]\.ffn_down_exps\.weight=iq5_k_r4
blk\.[2-5][0-9]\.ffn_down_exps\.weight=iq4_k_r4
blk\.60\.ffn_down_exps\.weight=iq4_k_r4

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_k_r4
blk\.[1][0-9]\.ffn_(gate|up)_exps\.weight=iq4_k_r4
blk\.[2-5][0-9]\.ffn_(gate|up)_exps\.weight=iq3_k_r4
blk\.60\.ffn_(gate|up)_exps\.weight=iq3_k_r4
"
custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --imatrix /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324.imatrix \
    --token-embedding-type iq6_k \
    --output-tensor-type iq5_k_r4 \
    --custom-q "$custom" \
    /mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ3_K.gguf \
    IQ3_K \
    24

@David-AU-github
Copy link

Super stoked about this, especially option to adjust weights for "shared expert" weights.
Its playtime.

@EAddario
Copy link
Contributor Author

EAddario commented Apr 7, 2025

TL;DR: A combination of Tensor-Wise Quantization (TWQ) and Layer-Wise Quantization (LWQ) is useful to generate custom models. Using DeepSeek-R1-Distill-Llama-8B-Q4_K_M as an example, LWQ yields a 10.4% smaller model with only a 0.83% 𝜌PPL penalty compared to the naive model.

More info here

Test results

Model Naive (GB) TWQ (GB) Reduction LWQ (GB) Reduction Naive 𝜌PPL TWQ 𝜌PPL LWQ 𝜌PPL
DeepSeek-R1-Distill-Llama-8B-IQ3_M 3.78 3.48 7.9% 3.67 2.9% 93.64% 91.75% 94.84%
DeepSeek-R1-Distill-Llama-8B-IQ3_S 3.68 3.24 12.0% 3.56 3.3% 93.71% 91.50% 93.48%
DeepSeek-R1-Distill-Llama-8B-IQ4_NL 4.68 4.3 8.1% 4.4 6.0% 98.82% 96.44% 95.87%
DeepSeek-R1-Distill-Llama-8B-Q3_K_L 4.32 3.45 20.1% 3.88 10.2% 97.25% 92.60% 94.83%
DeepSeek-R1-Distill-Llama-8B-Q3_K_M 4.02 3.37 16.2% 3.57 11.2% 96.92% 91.45% 94.63%
DeepSeek-R1-Distill-Llama-8B-Q3_K_S 3.66 3.28 10.4% 3.43 6.3% 94.59% 90.73% 92.46%
DeepSeek-R1-Distill-Llama-8B-Q4_K_M 4.92 4.44 9.8% 4.41 10.4% 98.85% 98.02% 98.04%
DeepSeek-R1-Distill-Llama-8B-Q4_K_S 4.69 4.31 8.1% 4.33 7.7% 99.01% 97.97% 97.57%
DeepSeek-R1-Distill-Llama-8B-Q5_K_M 5.73 5.35 6.6% 5.38 6.1% 99.09% 98.83% 98.95%
DeepSeek-R1-Distill-Llama-8B-Q5_K_S 5.6 5.19 7.3% 5.3 5.4% 99.00% 98.82% 98.82%
DeepSeek-R1-Distill-Llama-8B-Q6_K 6.6 6.17 6.5% 6.51 1.4% 99.47% 98.91% 99.18%
DeepSeek-R1-Distill-Llama-8B-Q8_0 8.54 7.84 8.2% 7.47 12.5% 99.93% 98.99% 98.97%

@David-AU-github
Copy link

David-AU-github commented Apr 7, 2025

TL;DR: A combination of Tensor-Wise Quantization (TWQ) and Layer-Wise Quantization (LWQ) is useful to generate custom models. Using DeepSeek-R1-Distill-Llama-8B-Q4_K_M as an example, LWQ yields a 10.4% smaller model with only a 0.83% 𝜌PPL penalty compared to the naive model.

More info here

Test results

Model Naive (GB) TWQ (GB) Reduction LWQ (GB) Reduction Naive 𝜌PPL TWQ 𝜌PPL LWQ 𝜌PPL
DeepSeek-R1-Distill-Llama-8B-IQ3_M 3.78 3.48 7.9% 3.67 2.9% 93.64% 91.75% 94.84%
DeepSeek-R1-Distill-Llama-8B-IQ3_S 3.68 3.24 12.0% 3.56 3.3% 93.71% 91.50% 93.48%
DeepSeek-R1-Distill-Llama-8B-IQ4_NL 4.68 4.3 8.1% 4.4 6.0% 98.82% 96.44% 95.87%
DeepSeek-R1-Distill-Llama-8B-Q3_K_L 4.32 3.45 20.1% 3.88 10.2% 97.25% 92.60% 94.83%
DeepSeek-R1-Distill-Llama-8B-Q3_K_M 4.02 3.37 16.2% 3.57 11.2% 96.92% 91.45% 94.63%
DeepSeek-R1-Distill-Llama-8B-Q3_K_S 3.66 3.28 10.4% 3.43 6.3% 94.59% 90.73% 92.46%
DeepSeek-R1-Distill-Llama-8B-Q4_K_M 4.92 4.44 9.8% 4.41 10.4% 98.85% 98.02% 98.04%
DeepSeek-R1-Distill-Llama-8B-Q4_K_S 4.69 4.31 8.1% 4.33 7.7% 99.01% 97.97% 97.57%
DeepSeek-R1-Distill-Llama-8B-Q5_K_M 5.73 5.35 6.6% 5.38 6.1% 99.09% 98.83% 98.95%
DeepSeek-R1-Distill-Llama-8B-Q5_K_S 5.6 5.19 7.3% 5.3 5.4% 99.00% 98.82% 98.82%
DeepSeek-R1-Distill-Llama-8B-Q6_K 6.6 6.17 6.5% 6.51 1.4% 99.47% 98.91% 99.18%
DeepSeek-R1-Distill-Llama-8B-Q8_0 8.54 7.84 8.2% 7.47 12.5% 99.93% 98.99% 98.97%

@EAddario
Just a heads up:
Distill (and preview models) are sensitive in layers 0-7 and 28-31.
You could give these more bits, and lower the other layers to maintain or augment function.

Comment on lines +379 to 381
void * kv_overrides; // pointer to vector containing overrides
void * tensor_types; // pointer to vector containing tensor types
} llama_model_quantize_params;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the public interface, so add a comment in #9289.

Note that passing C++ objects here is not correct and we eventually have to fix this API to not do that. It hasn't become a problem yet because the quantization functions are likely not used frequently by 3rd party applications.

@EAddario If you are interested, you can give it a shot in another PR and fix these structs to become C compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ggerganov, happy to

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit too hacky for my preference, but I suppose if people are already creating custom mixes by modifying the code it is better to at least have a tool to do it.

I would prefer if the allowed tensor check is removed, it doesn't really work as a reliable check, and it will prevent some legitimate uses.

@EAddario
Copy link
Contributor Author

Thanks for approving @slaren. Any particular use case you have in mind it will prevent? Maybe I can work it into the logic.

@EAddario
Copy link
Contributor Author

Got a better quality LWQ mix using the stats from the modified llama-imatrix. More info here

Test results

Model Naive (GB) TWQ (GB) Reduction LWQ (GB) Reduction Naive 𝜌PPL TWQ 𝜌PPL LWQ 𝜌PPL
DeepSeek-R1-Distill-Llama-8B-IQ3_M 3.78 3.48 7.9% 3.69 2.5% 93.64% 91.75% 94.24%
DeepSeek-R1-Distill-Llama-8B-IQ3_S 3.68 3.24 12.0% 3.43 6.8% 93.71% 91.50% 92.97%
DeepSeek-R1-Distill-Llama-8B-IQ4_NL 4.68 4.30 8.1% 4.39 6.1% 98.82% 96.44% 96.12%
DeepSeek-R1-Distill-Llama-8B-Q3_K_L 4.32 3.45 20.1% 3.76 13.0% 97.25% 92.60% 94.79%
DeepSeek-R1-Distill-Llama-8B-Q3_K_M 4.02 3.37 16.2% 3.56 11.3% 96.92% 91.45% 94.45%
DeepSeek-R1-Distill-Llama-8B-Q3_K_S 3.66 3.28 10.4% 3.31 9.7% 94.59% 90.73% 92.23%
DeepSeek-R1-Distill-Llama-8B-Q4_K_M 4.92 4.44 9.8% 4.41 10.5% 98.85% 98.02% 98.03%
DeepSeek-R1-Distill-Llama-8B-Q4_K_S 4.69 4.31 8.1% 4.28 8.8% 99.01% 97.97% 97.72%
DeepSeek-R1-Distill-Llama-8B-Q5_K_M 5.73 5.35 6.6% 5.38 6.2% 99.09% 98.83% 98.94%
DeepSeek-R1-Distill-Llama-8B-Q5_K_S 5.60 5.19 7.3% 5.24 6.4% 99.00% 98.82% 98.85%
DeepSeek-R1-Distill-Llama-8B-Q6_K 6.60 6.17 6.5% 6.57 0.5% 99.47% 98.91% 99.19%
DeepSeek-R1-Distill-Llama-8B-Q8_0 8.54 7.84 8.2% 7.73 9.4% 99.93% 98.99% 99.26%

@slaren
Copy link
Member

slaren commented Apr 12, 2025

Any particular use case you have in mind it will prevent? Maybe I can work it into the logic.

For example, using ffn as the pattern to set the type of all ffn tensors, or attn of all attention tensors, without having to specify each one individually.

@EAddario
Copy link
Contributor Author

EAddario commented Apr 13, 2025

I see what you mean. The choice of approach was a trade-off between ensuring the program continues to work exactly as before (backwards compatibility), not introducing new options that are already available (--pure, --output-tensor-type and --token-embedding-type), and adding new capabilities in a way that's consistent with existing error checking logic.

By restricting the tensors, users won't be able to do things that clearly are not useful, like trying to quantize norms, lerps, ropes, etc., but you're right in that users wanting to quantize all attn tensors would need to pass three options (--tensor-type attn_q=q4_k --tensor-type attn_k=q4_k --tensor-type attn_v=q4_k) instead of just one (--tensor-type attn=q4_k).

Once the changes are merged, I'll open a new PR to address this, within the tensor checking logic to avoid matching instances like attn_norm, ffn_norm, etc., plus implementing @ggerganov's recommendation to make the struct C compatible.

@ggerganov ggerganov merged commit 71e90e8 into ggml-org:master Apr 13, 2025
51 checks passed
@acbits
Copy link

acbits commented Apr 13, 2025

By restricting the tensors, users won't be able to do things that clearly are not useful, like trying to quantize norms, lerps, ropes, etc., but you're right in that users wanting to quantize all attn tensors would need to pass three options (--tensor-type attn_q=q4_k --tensor-type attn_k=q4_k --tensor-type attn_v=q4_k) instead of just one (--tensor-type attn=q4_k).

Late to this conversation, but isn't this case already handled by a regex that uses grouping?

--tensor-type 'attn_(q|k)'=q4_k could be applied to attn_q and attn_k?

@EAddario
Copy link
Contributor Author

Not quite @acbits. For the reasons described above, the program requires the full tensor name, with the regex applying only to preceding characters.

I'll improve this behaviour in the next PR.

@EAddario EAddario deleted the quantize branch April 14, 2025 07:27
@joseph777111
Copy link

@EAddario Congrats! 🚀

@Djip007
Copy link
Contributor

Djip007 commented Apr 19, 2025

Really late on that. And nice PR.

1 idea can we define all logic in more general way, for example using json format? and possibly read it from a file for most advance case?

@EAddario
Copy link
Contributor Author

Thanks @Djip007. @ngxson had a similar suggestion, and it's in my to-do list.

The way I'm thinking about it is for llama-imatrix (#12718) to generate a file with "recommended" quants, based on relevant statistics, which can then be processed by llama-quantize. The file can of course be changed/created by hand.

I don't know exactly what "recommended" means yet so open to suggestions.

@Djip007
Copy link
Contributor

Djip007 commented Apr 20, 2025

I don't know exactly what "recommended" means yet so open to suggestions.

I'll think about it... if I have any ideas, I'll try to share them.
Is there a better place than this closed PR to make suggestions?

@EAddario
Copy link
Contributor Author

Feel free to comment on #12718

colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
…ors (ggml-org#12511)

* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Minor refactoring as per the contributors' coding guidelines

* Update descriptions to match existing style

* Add llama_model_quantize_params parameters

* Add new quantize parameters parsing and validation

* Update usage

* Add new parameters defaults

* Add new quantization parameters logic

* Minor refactoring as per the contributors' guidelines

* Implement general --tensor-type instead of tensor-specific command option

* Fix implied type bug

* Restore missing #includes

* Add regex capability for tensor selection

* Refactor function name and update ALLOWED_TENSOR_TYPE

* Add missing #include

* Handle edge case when tensor name is cls.output

* Minor logging improvement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.