-
Notifications
You must be signed in to change notification settings - Fork 31.7k
HFQuantizer implementation for compressed-tensors library #31704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 33 commits
d695ec3
f468964
41224d3
b61bfb9
ff8f1c5
c1cb55d
ef9d3f1
117d050
1901c3e
3ca270d
9a14b09
ec59052
520ded8
7dec8fc
afb550d
d9b3660
e51ac59
ccb5442
bfd9220
71a80f9
547f9cc
8acbc09
eaa5f20
4ba75fb
94ea0d3
c48840d
ab74d26
2ecf711
e1ae504
ea9e927
1c3ad5c
aa1a4f9
81a13dd
f53d7b9
d8f7073
c4fbf70
1992a88
298a638
3cb4415
a943157
64f475a
fabe8a3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| <!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
| --> | ||
|
|
||
| # Compressed Tensors | ||
|
|
||
| Compressed tensors supports the quantization of models to a variety of formats and provides an extensible | ||
| framework for adding new formats and strategies. | ||
|
|
||
| Compressed models can be easily created using [llm-compressor](https://github.com/vllm-project/llm-compressor). | ||
| Alternatively models can be created indepedenty and serialized with a compressed tensors config. | ||
|
|
||
| Supported formats include: | ||
|
|
||
| - FP8, INT4, INT8 (for Q/DQ arbitrary precision is allowed for INT) | ||
| - Activation quantization (static) | ||
| - Dynamic per-token activation quantization | ||
| - Supports quantization of arbitrary layer types | ||
| - Targeted support or ignoring of layers by name or class | ||
|
|
||
| ## Installation | ||
|
|
||
| ```bash | ||
| pip install compressed-tensors | ||
| ``` | ||
|
|
||
|
|
||
| ## Sample Model Load | ||
| ```python | ||
| from transformers import AutoModelForCausalLM | ||
| compressed_tensors_model = AutoModelForCausalLM.from_pretrained("nm-testing/tinyllama-oneshot-w4a16-group128-v3") | ||
| ``` | ||
|
|
||
SunMarc marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## More Coming Soon! | ||
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # Copyright 2024 The HuggingFace Inc. team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| from ..utils import is_compressed_tensors_available, is_torch_available, logging | ||
| from ..utils.quantization_config import QuantizationConfigMixin | ||
| from .base import HfQuantizer | ||
|
|
||
|
|
||
| if is_torch_available(): | ||
| import torch | ||
|
|
||
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| class CompressedTensorsHfQuantizer(HfQuantizer): | ||
| """ | ||
| Quantizer for the compressed_tensors package. Loads and restores models to | ||
| quantized state with compressed_tensors | ||
| """ | ||
|
|
||
| requires_calibration = False | ||
Satrat marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| required_packages = ["compressed_tensors"] | ||
|
|
||
| def __init__(self, quantization_config: QuantizationConfigMixin, **kwargs): | ||
| super().__init__(quantization_config, **kwargs) | ||
|
|
||
| from compressed_tensors.compressors import ModelCompressor | ||
|
|
||
| self.compressor = ModelCompressor.from_compression_config(quantization_config) | ||
|
|
||
| def validate_environment(self, *args, **kwargs): | ||
| if not is_compressed_tensors_available(): | ||
| raise ImportError( | ||
| "Using `compressed_tensors` quantized models requires the compressed-tensors library: " | ||
| "`pip install compressed-tensors`" | ||
| ) | ||
| if not is_torch_available(): | ||
| # torch already should be installed as part of compressed tensors | ||
| raise ImportError("torch is required for using compressed-tensors quantization") | ||
|
|
||
| def update_torch_dtype(self, torch_dtype: "torch.dtype") -> "torch.dtype": | ||
| if torch_dtype is None: | ||
| logger.info("Loading model using torch.float16 for compressed-tensors quantization") | ||
| torch_dtype = torch.float16 | ||
| elif torch_dtype != torch.float16: | ||
| logger.info( | ||
| "We suggest you to set `torch_dtype=torch.float16` for better efficiency with compressed_tensors." | ||
| ) | ||
| return torch_dtype | ||
|
Comment on lines
+53
to
+60
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there an issue with bfloat16? We should try to allow this for llama models
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No issue with bfloat16, we just recommend float16 as a default since that is what vLLM expects for the scale/zp |
||
|
|
||
| def _process_model_before_weight_loading(self, model, **kwargs): | ||
| from compressed_tensors.quantization import apply_quantization_config | ||
|
|
||
| ct_quantization_config = self.compressor.quantization_config | ||
| apply_quantization_config(model, ct_quantization_config, run_compressed=True) | ||
SunMarc marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| def _process_model_after_weight_loading(self, model, **kwargs): | ||
| pass | ||
|
|
||
| @property | ||
| def is_trainable(self): | ||
| return False | ||
|
|
||
| @property | ||
| def is_serializable(self): | ||
| return False | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -42,6 +42,7 @@ class QuantizationMethod(str, Enum): | |
| QUANTO = "quanto" | ||
| EETQ = "eetq" | ||
| HQQ = "hqq" | ||
| COMPRESSED_TENSORS = "compressed-tensors" | ||
| FBGEMM_FP8 = "fbgemm_fp8" | ||
| TORCHAO = "torchao" | ||
|
|
||
|
|
@@ -1051,7 +1052,109 @@ def post_init(self): | |
| raise ValueError(f"Only support weights in {accepted_weights} but found {self.weights}") | ||
|
|
||
|
|
||
| @dataclass | ||
| class CompressedTensorsConfig(QuantizationConfigMixin): | ||
| """ | ||
| This is a wrapper class that handles compressed-tensors quantization config options. | ||
| It is a wrapper around `compressed_tensors.QuantizationConfig` | ||
| Args: | ||
| config_groups (`typing.Dict[str, typing.Union[ForwardRef('QuantizationScheme'), typing.List[str]]]`, *optional*): | ||
| dictionary mapping group name to a quantization scheme definition | ||
| format (`str`, *optional*, defaults to `"dense"`): | ||
| format the model is represented as | ||
|
Comment on lines
+1062
to
+1063
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the available formats?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ArthurZucker this includes the different compression formats, depending on how the model is quantized/saved on disk, including:
|
||
| quantization_status (`QuantizationStatus`, *optional*, defaults to `"initialized"`): | ||
| status of model in the quantization lifecycle, ie 'initialized', 'calibration', 'frozen' | ||
| kv_cache_scheme (`typing.Union[QuantizationArgs, NoneType]`, *optional*): | ||
| specifies quantization of the kv cache. If None, kv cache is not quantized. | ||
| global_compression_ratio (`typing.Union[float, NoneType]`, *optional*): | ||
| 0-1 float percentage of model compression | ||
| ignore (`typing.Union[typing.List[str], NoneType]`, *optional*): | ||
| layer names or types to not quantize, supports regex prefixed by 're:' | ||
| sparsity_config (`typing.Dict[str, typing.Any]`, *optional*): | ||
| configuration for sparsity compression | ||
| quant_method (`str`, *optional*, defaults to `"compressed-tensors"`): | ||
| do not override, should be compressed-tensors | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| config_groups: Dict[str, Union["QuantizationScheme", List[str]]] = None, # noqa: F821 | ||
| format: str = "dense", | ||
| quantization_status: "QuantizationStatus" = "initialized", # noqa: F821 | ||
| kv_cache_scheme: Optional["QuantizationArgs"] = None, # noqa: F821 | ||
| global_compression_ratio: Optional[float] = None, | ||
| ignore: Optional[List[str]] = None, | ||
| sparsity_config: Dict[str, Any] = None, | ||
| quant_method: str = "compressed-tensors", | ||
| **kwargs, | ||
| ): | ||
| from compressed_tensors import QuantizationConfig | ||
| from compressed_tensors.config import SparsityCompressionConfig | ||
|
|
||
| self.quantization_config = None | ||
| self.sparsity_config = None | ||
|
|
||
| # parse from dict to load nested QuantizationScheme objects | ||
| if config_groups: | ||
| self.quantization_config = QuantizationConfig.parse_obj( | ||
| { | ||
| "config_groups": config_groups, | ||
| "quant_method": quant_method, | ||
| "format": format, | ||
| "quantization_status": quantization_status, | ||
| "kv_cache_scheme": kv_cache_scheme, | ||
| "global_compression_ratio": global_compression_ratio, | ||
| "ignore": ignore, | ||
| **kwargs, | ||
| } | ||
| ) | ||
|
|
||
| if sparsity_config: | ||
| self.sparsity_config = SparsityCompressionConfig.load_from_registry( | ||
| sparsity_config.get("format"), **sparsity_config | ||
| ) | ||
|
|
||
| super().__init__(quant_method=QuantizationMethod.COMPRESSED_TENSORS) | ||
|
|
||
| @classmethod | ||
| def from_dict(cls, config_dict, return_unused_kwargs=False, **kwargs): | ||
| """ | ||
| Instantiates a [`CompressedTensorsConfig`] from a Python dictionary of parameters. | ||
| Optionally unwraps any args from the nested quantization_config | ||
|
|
||
| Args: | ||
| config_dict (`Dict[str, Any]`): | ||
| Dictionary that will be used to instantiate the configuration object. | ||
| return_unused_kwargs (`bool`,*optional*, defaults to `False`): | ||
| Whether or not to return a list of unused keyword arguments. Used for `from_pretrained` method in | ||
| `PreTrainedModel`. | ||
| kwargs (`Dict[str, Any]`): | ||
| Additional parameters from which to initialize the configuration object. | ||
|
|
||
| Returns: | ||
| [`QuantizationConfigMixin`]: The configuration object instantiated from those parameters. | ||
| """ | ||
| if "quantization_config" in config_dict: | ||
| config_dict = dict( | ||
| sparsity_config=config_dict.get("sparsity_config"), | ||
| **config_dict["quantization_config"], | ||
| ) | ||
|
|
||
| return super().from_dict(config_dict, return_unused_kwargs=return_unused_kwargs, **kwargs) | ||
|
|
||
| def to_dict(self) -> Dict[str, Any]: | ||
| """ | ||
| Serializes this instance to a Python dictionary. Returns: | ||
| `Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance. | ||
| """ | ||
| quantization_config = self.quantization_config.dict() if self.quantization_config is not None else None | ||
| sparsity_config = self.sparsity_config.dict() if self.sparsity_config is not None else None | ||
|
|
||
| return { | ||
| "quantization_config": quantization_config, | ||
| "sparsity_config": sparsity_config, | ||
| } | ||
|
|
||
|
|
||
| class FbgemmFp8Config(QuantizationConfigMixin): | ||
Satrat marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
| This is a wrapper class about all possible attributes and features that you can play with a model that has been | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.