huggingface
diff --git a/‎.circleci/config.yml‎
Lines changed: 2 additions & 2 deletions b/‎.circleci/config.yml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/bug-report.yml‎
Lines changed: 1 addition & 0 deletions b/‎.github/ISSUE_TEMPLATE/bug-report.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docker/transformers-quantization-latest-gpu/Dockerfile‎
Lines changed: 3 additions & 0 deletions b/‎docker/transformers-quantization-latest-gpu/Dockerfile‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/source/en/_toctree.yml‎
Lines changed: 4 additions & 0 deletions b/‎docs/source/en/_toctree.yml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/source/en/main_classes/data_collator.md‎
Lines changed: 3 additions & 0 deletions b/‎docs/source/en/main_classes/data_collator.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/source/en/main_classes/quantization.md‎
Lines changed: 8 additions & 0 deletions b/‎docs/source/en/main_classes/quantization.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/helium.md‎
Lines changed: 4 additions & 8 deletions b/‎docs/source/en/model_doc/helium.md‎
Lines changed: 4 additions & 8 deletions
diff --git a/‎docs/source/en/quantization/finegrained_fp8.md‎
Lines changed: 62 additions & 0 deletions b/‎docs/source/en/quantization/finegrained_fp8.md‎
Lines changed: 62 additions & 0 deletions
diff --git a/‎docs/source/en/quantization/overview.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/source/en/quantization/overview.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/source/en/quantization/spqr.md‎
Lines changed: 35 additions & 0 deletions b/‎docs/source/en/quantization/spqr.md‎
Lines changed: 35 additions & 0 deletions
@@ -58,7 +58,7 @@ jobs:
             - run:
                 name: "Prepare pipeline parameters"
                 command: |
-                    python utils/process_test_artifacts.py 
+                    python utils/process_test_artifacts.py
 
             # To avoid too long generated_config.yaml on the continuation orb, we pass the links to the artifacts as parameters.
             # Otherwise the list of tests was just too big. Explicit is good but for that it was a limitation.
@@ -110,7 +110,7 @@ jobs:
             - run:
                 name: "Prepare pipeline parameters"
                 command: |
-                    python utils/process_test_artifacts.py 
+                    python utils/process_test_artifacts.py
 
             # To avoid too long generated_config.yaml on the continuation orb, we pass the links to the artifacts as parameters.
             # Otherwise the list of tests was just too big. Explicit is good but for that it was a limitation.
 
@@ -106,6 +106,7 @@ body:
       label: Reproduction
       description: |
         Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
+        Please include relevant config information with your code, for example your Trainers, TRL, Peft, and DeepSpeed configs.
         If you have code snippets, error messages, stack traces please provide them here as well.
         Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
         Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
 
@@ -53,6 +53,9 @@ RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.2
 # Add vptq for quantization testing
 RUN python3 -m pip install --no-cache-dir vptq
 
+# Add spqr for quantization testing
+RUN python3 -m pip install --no-cache-dir spqr_quant[gpu]
+
 # Add hqq for quantization testing
 RUN python3 -m pip install --no-cache-dir hqq
 
 
@@ -166,6 +166,8 @@
   - local: quantization/aqlm
     title: AQLM
   - local: quantization/vptq
+    title: SpQR
+  - local: quantization/spqr
     title: VPTQ
   - local: quantization/quanto
     title: Quanto
@@ -185,6 +187,8 @@
     title: BitNet
   - local: quantization/compressed_tensors
     title: compressed-tensors
+  - local: quantization/finegrained_fp8
+    title: Fine-grained FP8
   - local: quantization/contribute
     title: Contribute new quantization method
   title: Quantization Methods
 
@@ -71,3 +71,6 @@ Examples of use can be found in the [example scripts](../examples) or [example n
 
 [[autodoc]] data.data_collator.DataCollatorWithFlattening
 
+# DataCollatorForMultipleChoice
+
+[[autodoc]] data.data_collator.DataCollatorForMultipleChoice
@@ -80,3 +80,11 @@ Learn how to quantize models in the [Quantization](../quantization) guide.
 ## BitNetConfig
 
 [[autodoc]] BitNetConfig
+
+## SpQRConfig
+
+[[autodoc]] SpQRConfig
+
+## FineGrainedFP8Config
+
+[[autodoc]] FineGrainedFP8Config
@@ -107,24 +107,20 @@ Tips:
 
 ## Usage tips
 
-`Helium` can be found on the [Huggingface Hub](https://huggingface.co/collections/kyutai/helium-1-preview)
+`Helium` can be found on the [Huggingface Hub](https://huggingface.co/models?other=helium)
 
 In the following, we demonstrate how to use `helium-1-preview` for the inference. 
 
 ```python
 >>> from transformers import AutoModelForCausalLM, AutoTokenizer
 >>> device = "cuda" # the device to load the model onto
 
->>> model = AutoModelForCausalLM.from_pretrained("helium-1-preview", device_map="auto")
->>> tokenizer = AutoTokenizer.from_pretrained("helium-1-preview")
+>>> model = AutoModelForCausalLM.from_pretrained("kyutai/helium-1-preview-2b", device_map="auto")
+>>> tokenizer = AutoTokenizer.from_pretrained("kyutai/helium-1-preview-2b")
 
 >>> prompt = "Give me a short introduction to large language model."
 
->>> messages = [{"role": "user", "content": prompt}]
-
->>> text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-
->>> model_inputs = tokenizer([text], return_tensors="pt").to(device)
+>>> model_inputs = tokenizer(prompt, return_tensors="pt").to(device)
 
 >>> generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
 
 
@@ -0,0 +1,62 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Fine-grained FP8
+
+With FP8 quantization method, you can quantize your model in FP8 (W8A8):
+- the weights will be quantized in 8bit (FP8) per 2D block (e.g. weight_block_size=(128, 128)) which is inspired from the deepseek implementation
+- Activations are quantized to 8 bits (FP8) per group per token, with the group value matching that of the weights in the input channels (128 by default)
+
+It's implemented to add support for DeepSeek-V3 and DeepSeek-R1 models, you can see the paper [here](https://arxiv.org/pdf/2412.19437), and the image below explains the quantization scheme : 
+
+![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/b7b3b34bf826a6423ea82ffc57ecac80c46c3c76/transformers/quantization/quantization_deepseek.png)
+
+> [!TIP]
+> You need a GPU with compute capability>=9 (e.g. H100) 
+
+Before you begin, make sure the following libraries are installed with their latest version:
+
+```bash
+pip install --upgrade accelerate torch
+```
+> [!TIP]
+> You need to install a torch version compatible with the cuda version of your GPU.
+
+
+By default, the weights are loaded in full precision (torch.float32) regardless of the actual data type the weights are stored in such as torch.float16. Set `torch_dtype="auto"` to load the weights in the data type defined in a model's `config.json` file to automatically load the most memory-optimal data type.
+
+```py
+from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer
+
+model_name = "meta-llama/Meta-Llama-3-8B"
+quantization_config = FineGrainedFP8Config()
+quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
+
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+input_text = "What are we having for dinner?"
+input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+
+output = quantized_model.generate(**input_ids, max_new_tokens=10)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".
+
+```py
+quant_path = "/path/to/save/quantized/model"
+model.save_pretrained(quant_path)
+model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
+```
@@ -61,7 +61,8 @@ Use the table below to help you decide which quantization method to use.
 | [FBGEMM_FP8](./fbgemm_fp8.md)                 | 🟢                   | 🔴              | 🟢        | 🔴        | 🔴                                 | 🔴              | 🔴              | 8             | 🔴               | 🟢                          | 🟢                      | https://github.com/pytorch/FBGEMM       |
 | [torchao](./torchao.md)                       | 🟢                   |                 | 🟢        | 🔴        | 🟡 <sub>5</sub> | 🔴              |                 | 4/8         |                  | 🟢🔴                        | 🟢                      | https://github.com/pytorch/ao       |
 | [VPTQ](./vptq.md)                             | 🔴                   | 🔴              |     🟢     | 🟡        | 🔴                                 | 🔴              | 🟢              | 1/8         | 🔴               | 🟢                          | 🟢                      | https://github.com/microsoft/VPTQ            |
-
+| [SpQR](./spqr.md)                          | 🔴                       |  🔴   | 🟢        | 🔴              |    🔴    | 🔴         |         🟢              | 3              |              🔴                     | 🟢           | 🟢                      | https://github.com/Vahe1994/SpQR/       |
+| [FINEGRAINED_FP8](./finegrained_fp8.md)                 | 🟢                   | 🔴              | 🟢        | 🔴        | 🔴                                 | 🔴              | 🔴              | 8             | 🔴               | 🟢                          | 🟢                      |        |
 <Tip>
 
 **1:** bitsandbytes is being refactored to support multiple backends beyond CUDA. Currently, ROCm (AMD GPU) and Intel CPU implementations are mature, with Intel XPU in progress and Apple Silicon support expected by Q4/Q1. For installation instructions and the latest backend updates, visit [this link](https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend). Check out [these docs](https://huggingface.co/docs/bitsandbytes/main/en/non_cuda_backends) for more details and feedback links.
 
@@ -0,0 +1,35 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# SpQR
+
+[SpQR](https://github.com/Vahe1994/SpQR) quantization algorithm involves a 16x16 tiled bi-level group 3-bit quantization structure, with sparse outliers as detailed in [SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression](https://arxiv.org/abs/2306.03078).
+
+To SpQR-quantize a model, refer to the [Vahe1994/SpQR](https://github.com/Vahe1994/SpQR) repository.
+
+Load a pre-SpQR-quantized model in [`~PreTrainedModel.from_pretrained`].
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+quantized_model = AutoModelForCausalLM.from_pretrained(
+    "elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf",
+    torch_dtype=torch.half,
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained("elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf")
+```