Skip to content

Commit 03d01ad

Browse files
committed
readme fixes
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
1 parent f9f191b commit 03d01ad

File tree

2 files changed

+10
-14
lines changed

2 files changed

+10
-14
lines changed

README.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,10 +39,8 @@ quantize_(m, int4_weight_only())
3939
```
4040

4141
For gpt-fast `int4_weight_only()` is the best option at bs=1 as it **2x the tok/s and reduces the VRAM requirements by about 65%** over a torch.compiled baseline.
42-
Note: For models that are less memory bound, the int4 weight only quantization kernel can be slower than other kernels, if you are seeing slowdowns, using [autoquant](./torchao/quantization/README.md#autoquantization) with int4 quantization
43-
can solve the issue. See the [quantization readme](./torchao/quantization/README.md#autoquantization) for details.
4442

45-
If you're unsure which option to use, you can also run [autoquant](./torchao/quantization/README.md#autoquantization) which will automatically profile layers for you and skip quantizing layers where overhead is too large.
43+
If you see slowdowns with any of these techniques or you're unsure which option to use, consider using [autoquant](./torchao/quantization/README.md#autoquantization) which will automatically profile layers and pick the best way to quantize each layer.
4644

4745
```python
4846
model = torchao.autoquant(torch.compile(model, mode='max-autotune'))

torchao/quantization/README.md

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -30,32 +30,30 @@ And a quick crash course on inference quantization to help parse the above table
3030
## Autoquantization
3131

3232
The `autoquant` api can be used to quickly and accurately quantize your model. When used as in the example below, the api first identifies the shapes
33-
of the activations that the different linear layers see, it then benchmarks these shapes across different types of quantized and non-quantized layers in order to pick the fastest one, attempting to take into account fusions where possible. Finally once the best class is found for each layer, it swaps the linear. Currently this api chooses between no quantization, int8 dynamic quantization and int8 weight only quantization for each layer by default.
33+
of the activations that the different linear layers see, it then benchmarks these shapes across different types of quantized and non-quantized layers in order to pick the fastest one, attempting to take into account fusions where possible. Finally once the best class is found for each layer, it swaps the linear. By default the api only uses int8 techniques, i.e. it chooses between no quantization, int8 dynamic quantization and int8 weight only quantization for each layer, though there is also an option add int4 quantization which can be used for maximum performance or to avoid perf regressions from `int4_weight_only()`.
3434

3535
```python
3636
import torch
3737
import torchao
38+
from torchao.quantization import DEFAULT_INT4_AUTOQUANT_CLASS_LIST
3839

3940
# Plug in your model and example input
4041
model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
4142
input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda')
43+
use_autoquant_default = True
4244

43-
# perform autoquantization and torch.compile
44-
model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
45+
if use_autoquant_default:
46+
# perform autoquantization and torch.compile with default settings
47+
model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
48+
elif not use_autoquant_default:
49+
# perform autoquantization and torch.compile with int4 support
50+
model = torchao.autoquant(torch.compile(model, mode='max-autotune'), qtensor_class_list=DEFAULT_INT4_AUTOQUANT_CLASS_LIST)
4551

4652
# pass in an input which is used in order to pick fastest quantization operations
4753
# and apply torch compilation.
4854
model(input)
4955
```
5056

51-
There is also an option to add int4 weight only quantization as an `autoquant` option for maximum performance or if applying int4 quantization without `autoquant` causes a perf regression. In such cases, `autoquant` will avoid quantizing the layers that are causing the perf regression.
52-
53-
```python
54-
from torchao.quantization import DEFAULT_INT4_AUTOQUANT_CLASS_LIST
55-
model = torchao.autoquant(torch.compile(model, mode='max-autotune'), qtensor_class_list=torchao.quantization.DEFAULT_INT4_AUTOQUANT_CLASS_LIST)
56-
model(input)
57-
```
58-
5957
Sometimes it is desirable to reuse a quantization plan that `autoquant` came up with. `torchao.quantization.AUTOQUANT_CACHE` is a dictionary holding autoquant's benchmark results. We can save it and restore it later, which will cause `autoquant` to choose the same quantization methods.
6058

6159
```python

0 commit comments

Comments
 (0)