-
Notifications
You must be signed in to change notification settings - Fork 11.5k
quantize: Handle user-defined quantization levels for additional tensors #12511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
09f716d
ac908af
337d979
6f8d16d
71c9f93
8e18131
a77d947
2414eaa
0dd66b8
1d841c6
120f71b
dbcc0b5
d86de03
99bae5e
60b0a53
3e2063d
b99fa62
f97b693
f11e3da
ad1e352
4e5c96a
9b3ccb5
35f45f1
071e9ef
54e13cf
31d642c
b3c7db5
625f0ae
2fd0b41
3e9f565
054ede4
5a304b8
1acb9f4
04604a4
30443a5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -366,17 +366,18 @@ extern "C" { | |
|
||
// model quantization parameters | ||
typedef struct llama_model_quantize_params { | ||
int32_t nthread; // number of threads to use for quantizing, if <=0 will use std::thread::hardware_concurrency() | ||
enum llama_ftype ftype; // quantize to this llama_ftype | ||
enum ggml_type output_tensor_type; // output tensor type | ||
enum ggml_type token_embedding_type; // token embeddings tensor type | ||
bool allow_requantize; // allow quantizing non-f32/f16 tensors | ||
bool quantize_output_tensor; // quantize output.weight | ||
bool only_copy; // only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored | ||
bool pure; // quantize all tensors to the default type | ||
bool keep_split; // quantize to the same number of shards | ||
void * imatrix; // pointer to importance matrix data | ||
void * kv_overrides; // pointer to vector containing overrides | ||
int32_t nthread; // number of threads to use for quantizing, if <=0 will use std::thread::hardware_concurrency() | ||
enum llama_ftype ftype; // quantize to this llama_ftype | ||
enum ggml_type output_tensor_type; // output tensor type | ||
enum ggml_type token_embedding_type; // token embeddings tensor type | ||
bool allow_requantize; // allow quantizing non-f32/f16 tensors | ||
bool quantize_output_tensor; // quantize output.weight | ||
bool only_copy; // only copy tensors - ftype, allow_requantize and quantize_output_tensor are ignored | ||
bool pure; // quantize all tensors to the default type | ||
bool keep_split; // quantize to the same number of shards | ||
void * imatrix; // pointer to importance matrix data | ||
void * kv_overrides; // pointer to vector containing overrides | ||
void * tensor_types; // pointer to vector containing tensor types | ||
} llama_model_quantize_params; | ||
Comment on lines
+379
to
381
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This changes the public interface, so add a comment in #9289. Note that passing C++ objects here is not correct and we eventually have to fix this API to not do that. It hasn't become a problem yet because the quantization functions are likely not used frequently by 3rd party applications. @EAddario If you are interested, you can give it a shot in another PR and fix these structs to become C compatible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @ggerganov, happy to |
||
|
||
typedef struct llama_logit_bias { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm rethinking about this, maybe we can simplify this functionality by adding just 2 flags:
--dump-mapping
to get the list of tensors and the target quantized type. User can then modify the target quant directly--mapping FILE
user can then specify a custom mapping file given from step aboveThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to only allow certain tensors to be quantized, otherwise users will lobotomize their model and then complain that llama.cpp is broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @ddh0 although I can see how down the line something similar to what @ngxson is suggesting may be useful: I'm testing the layer-wise quant using the modified llama-imatrix for guidance, and whilst I'm getting some really encouraging results (I'll publish the full model in my HF repo over the weekend), the process is overly manual and the regexes can get unwieldy (e.g.
--tensor-type "(1[3-9]|2[0-9]|30)\.attn_v=q6_k" --tensor-type "([0-9]|[1-2][0-9]|30|31)\.ffn_down=q3_k" --tensor-type "(10|1[3-9]|2[0-9]|30)\.attn_q=q5_k" ...
I think it would be nice to have a way to AutoMagically generate optimum regexes to be fed into llama-quantize!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case of full granular-control vs guided hand-holding to competence. And, we can have the best of both worlds. What we will need is a brief, informative How-To Guide, which introduces and explains the concepts of per-tensor and per-layer quantization, and then gives concrete examples, which they can base their quantization decisions on. Adding to this: the guide should further educate the User on which tensors/weights are targets for quantization (Embeddings, ATTN_K, ATTN_Q, ATTN_V, ATTN_Output, FFN_Down, FFN_Gate, FFN_Up, and, Output) and which are more likely not to be good targets for quantization (FFN_NORM, etc) and why. And, then for the more dense of mankind: a brief disclaimer stating that any and all modifications that they make to their custom quantized model is their business and responsibility and thereby waiving ggml-org or any of you guys from liability. 🤔
Just because some choose not to read and learn, doesn't mean we should have to suffer a loss of "power-user" features, because those who aren't paying attention will lobotomize their quantized models. This is all a fun game of trial-error and experimentation. If the users have made it this far, they will have to learn.