Skip to content

Commit e5ee771

Browse files
authored
Minor README updates (#401)
* Minor README updates * Update README.md * push * push * push
1 parent 0304281 commit e5ee771

File tree

1 file changed

+16
-19
lines changed

1 file changed

+16
-19
lines changed

README.md

Lines changed: 16 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,11 @@
22

33
[![](https://dcbadge.vercel.app/api/server/cudamode?style=flat)](https://discord.gg/cudamode)
44

5+
[Introduction](#introduction) | [Inference](#inference) | [Training](#training) | [Dtypes](#newer-dtypes) | [Composability](#composability) | [Installation](#installation) | [Community Contributions](#community-contributions) | [How to contribute](#how-to-contribute)
56

67
## Introduction
78

8-
torchao is a library to create and integrate high-performance custom data types, layouts and kernels into their PyTorch workflows with up to **2x speedups** with **65%** less VRAM for [inference](#inference) and support for [training](#training)
9+
torchao is a library to create and integrate high-performance custom data types, layouts and kernels into your PyTorch workflows with up to **2x speedups** with **65% less VRAM** for [inference](#inference) and support for [training](#training)
910

1011
All with no intrusive code changes and minimal accuracy degradation.
1112

@@ -15,7 +16,7 @@ All with no intrusive code changes and minimal accuracy degradation.
1516

1617
#### Without intrusive code changes
1718

18-
Quantizing your models is a 1 liner that should work on any model with `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage instructions [here](torchao/quantization/) and a hugginface inference example [here](scripts/hf_eval.py)
19+
Quantizing your models is a 1 liner that should work on any model with an `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage instructions [here](torchao/quantization/) and a HuggingFace inference example [here](scripts/hf_eval.py)
1920

2021
```python
2122
from torchao.quantization.quant_api import quantize
@@ -59,12 +60,10 @@ We've added support for semi-structured 2:4 sparsity with 6% end to end speedups
5960

6061
The code change is a 1 liner with the full example available [here](torchao/sparsity/training/)
6162

62-
6363
```python
6464
swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear})
6565
```
6666

67-
6867
## Newer dtypes
6968

7069
* [MX](torchao/prototype/mx_formats) implementing training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet.
@@ -73,12 +72,11 @@ swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear})
7372

7473
## Composability
7574

76-
A key design principle for us is composability as in any new dtype or layout we provide needs to work with `torch.compile()` and needs to work with `FSDP`. It shouldn't matter if the kernels are written are pure PyTorch, CUDA, C++, or Triton - things should just work! And here is our current strategy
75+
A key design principle for us is composability as in any new dtype or layout we provide needs to work with `torch.compile()` and needs to work with `FSDP`. It shouldn't matter if the kernels are written in pure PyTorch, CUDA, C++, or Triton - things should just work! And here is our current strategy
7776
1. Write the dtype, layout or bit packing logic in pure PyTorch and code-generate efficient kernels with torch.compile. You can inspect those kernels with `TORCH_LOGS="output_code" python your_code.py` and check if a single kernel is being generated and if any unnecessary buffers are being created
78-
2. However once you get a kernel, how do you know how good it is? The best way is to benchmark the code-generated code with the best kernel on the market. But packaging custom CPP/CUDA kernels that work on multiple devices is tedious but we've abstracted all the tedium from you with our [custom ops support](./torchao/csrc/) so if you love writing kernels but hate packaging, we'd love to accept contributions for your custom ops. One key benefit is a kernel written as a custom op will just work with no graph breaks with `torch.compile()`. Compilers are great at optimizations like fusions and overhead reduction but it's challenging for compilers to rewrite the math of an algorithm such that it's faster but also numerically stable so we are betting on both compilers and custom ops
79-
3. Finally while historically most quantization has been done for inference there is now a thriving area of research combining lower dtypes and sharding. One popular example is [NF4](torchao/dtypes/nf4tensor.py) which is used to create the QLoRA algorithm and you can define the semantics for how custom tensors should be sharded over multiple devices. We gave an accessible talk on [how to do this](https://x.com/HamelHusain/status/1800315287574847701).
77+
2. However once you get a kernel, how do you know how good it is? The best way is to benchmark the compiler generated code with the best kernel on the market. But packaging custom CPP/CUDA kernels that work on multiple devices is tedious but we've abstracted all the tedium from you with our [custom ops support](./torchao/csrc/) so if you love writing kernels but hate packaging, we'd love to accept contributions for your custom ops. One key benefit is a kernel written as a custom op will just work with no graph breaks with `torch.compile()`. Compilers are great at optimizations like fusions and overhead reduction but it's challenging for compilers to rewrite the math of an algorithm such that it's faster but also numerically stable so we are betting on both compilers and custom ops
78+
3. Finally while historically most quantization has been done for inference, there is now a thriving area of research combining distributed algorithms and quantization. One popular example is [NF4](torchao/dtypes/nf4tensor.py) which was used to implement the QLoRA algorithm. The NF4 tensor also contains semantics for how it should be sharded over multiple devices so it composes with FSDP. We gave an accessible talk on [how to do this](https://x.com/HamelHusain/status/1800315287574847701).
8079

81-
## Get Started
8280

8381
### Installation
8482
`torchao` makes liberal use of several new features in Pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.
@@ -93,6 +91,13 @@ Nightly Release
9391
pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu121 # full options are cpu/cu118/cu121/cu124
9492
```
9593

94+
From source
95+
```Shell
96+
git clone https://github.com/pytorch/ao
97+
cd ao
98+
python setup.py install
99+
```
100+
96101
## Community Contributions
97102

98103
* [jeromeku](https://github.com/jeromeku) has implemented
@@ -101,29 +106,21 @@ pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/n
101106
* [Fused int4/fp16 Quant Matmul](torchao/prototype/hqq) which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512
102107
* [gau-nernst](https://github.com/gau-nernst) fp6 kernels that are 4x faster than fp16 [torchao/prototype/fp6_llm](torchao/prototype/fp6_llm)
103108
* [vayuda](https://github.com/vayuda) with generic bitpacking kernels that were code generated using pure PyTorch [prototype/common](torchao/prototype/common)
104-
* [andreaskopf](https://github.com/andreaskoepf) and [melvinebenezer](https://github.com/melvinebenezer) with [1 bit LLMs](torchao/prototype/dtypes) Bitnet 1.58 bitpacked into uin2 and fully code-generated with torch.compile
109+
* [andreaskopf](https://github.com/andreaskoepf) and [melvinebenezer](https://github.com/melvinebenezer) with [1 bit LLMs](torchao/prototype/dtypes) Bitnet 1.58 bitpacked into uint2 and fully code-generated with torch.compile
105110

106111
## How to contribute
107112

108113
This repository is currently under heavy development
109114
* If you have suggestions on the API or use cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues)
110115
* If you'd like to co-develop the library with us please join us on #torchao on [discord.gg/cudamode](https://discord.gg/cudamode) - there are a lot of dtypes out there and we could use a lot more hands to make them go brrr
111116

112-
Installation instructions
113-
114-
```Shell
115-
git clone https://github.com/pytorch/ao
116-
cd ao
117-
python setup.py install
118-
```
119-
120-
If you're contributing a feature ao
117+
If you're contributing a feature to ao
121118
```Shell
122119
pip install -r dev-requirements.txt
123120
python setup.py develop
124121
```
125122

126-
For *most* developers you probably want to skip building custom C++/CUDA extensions for faster iteration cycles
123+
For *most* developers you probably want to skip building custom C++/CUDA extensions for faster iteration
127124

128125
```shell
129126
USE_CPP=0 python setup.py install

0 commit comments

Comments
 (0)