You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For gpt-fast `int4_weight_only()` is the best option at bs=1 as it **2x the tok/s and reduces the VRAM requirements by about 65%** over a torch.compiled baseline.
42
-
Note: For models that are less memory bound, the int4 weight only quantization kernel can be slower than other kernels, if you are seeing slowdowns, using [autoquant](./torchao/quantization/README.md#autoquantization) with int4 quantization
43
-
can solve the issue. See the [quantization readme](./torchao/quantization/README.md#autoquantization) for details.
44
42
45
-
If you're unsure which option to use, you can also run [autoquant](./torchao/quantization/README.md#autoquantization) which will automatically profile layers for you and skip quantizing layers where overhead is too large.
43
+
If you see slowdowns with any of these techniques or you're unsure which option to use, consider using [autoquant](./torchao/quantization/README.md#autoquantization) which will automatically profile layers and pick the best way to quantize each layer.
46
44
47
45
```python
48
46
model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
Copy file name to clipboardExpand all lines: torchao/quantization/README.md
+9-11Lines changed: 9 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -30,32 +30,30 @@ And a quick crash course on inference quantization to help parse the above table
30
30
## Autoquantization
31
31
32
32
The `autoquant` api can be used to quickly and accurately quantize your model. When used as in the example below, the api first identifies the shapes
33
-
of the activations that the different linear layers see, it then benchmarks these shapes across different types of quantized and non-quantized layers in order to pick the fastest one, attempting to take into account fusions where possible. Finally once the best class is found for each layer, it swaps the linear. Currently this api chooses between no quantization, int8 dynamic quantization and int8 weight only quantization for each layer by default.
33
+
of the activations that the different linear layers see, it then benchmarks these shapes across different types of quantized and non-quantized layers in order to pick the fastest one, attempting to take into account fusions where possible. Finally once the best class is found for each layer, it swaps the linear. By default the api only uses int8 techniques, i.e. it chooses between no quantization, int8 dynamic quantization and int8 weight only quantization for each layer, though there is also an option add int4 quantization which can be used for maximum performance or to avoid perf regressions from `int4_weight_only()`.
34
34
35
35
```python
36
36
import torch
37
37
import torchao
38
+
from torchao.quantization importDEFAULT_INT4_AUTOQUANT_CLASS_LIST
38
39
39
40
# Plug in your model and example input
40
41
model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
45
+
if use_autoquant_default:
46
+
# perform autoquantization and torch.compile with default settings
47
+
model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
48
+
elifnot use_autoquant_default:
49
+
# perform autoquantization and torch.compile with int4 support
50
+
model = torchao.autoquant(torch.compile(model, mode='max-autotune'), qtensor_class_list=DEFAULT_INT4_AUTOQUANT_CLASS_LIST)
45
51
46
52
# pass in an input which is used in order to pick fastest quantization operations
47
53
# and apply torch compilation.
48
54
model(input)
49
55
```
50
56
51
-
There is also an option to add int4 weight only quantization as an `autoquant` option for maximum performance or if applying int4 quantization without `autoquant` causes a perf regression. In such cases, `autoquant` will avoid quantizing the layers that are causing the perf regression.
52
-
53
-
```python
54
-
from torchao.quantization importDEFAULT_INT4_AUTOQUANT_CLASS_LIST
55
-
model = torchao.autoquant(torch.compile(model, mode='max-autotune'), qtensor_class_list=torchao.quantization.DEFAULT_INT4_AUTOQUANT_CLASS_LIST)
56
-
model(input)
57
-
```
58
-
59
57
Sometimes it is desirable to reuse a quantization plan that `autoquant` came up with. `torchao.quantization.AUTOQUANT_CACHE` is a dictionary holding autoquant's benchmark results. We can save it and restore it later, which will cause `autoquant` to choose the same quantization methods.
0 commit comments