Skip to content

Commit 2525ed3

Browse files
committed
Add Claude MD file
stack-info: PR: #2311, branch: drisspg/stack/66
1 parent 9cd5851 commit 2525ed3

File tree

1 file changed

+126
-0
lines changed

1 file changed

+126
-0
lines changed

CLAUDE.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
torchao is PyTorch's official Architecture Optimization library that accelerates PyTorch models through advanced quantization and sparsification techniques. It provides optimization for weights, gradients, activations, and more for both inference and training with minimal code changes.
8+
9+
## Development Commands
10+
11+
### Installation & Build
12+
```bash
13+
# Development install (Python-only mode, fastest for development)
14+
USE_CPP=0 python setup.py develop
15+
16+
# Full build with C++/CUDA extensions
17+
python setup.py develop
18+
19+
# Install specific version of ruff for linting
20+
pip install ruff==0.11.6
21+
```
22+
23+
### Testing
24+
```bash
25+
# Run specific test files
26+
pytest test/float8/test_base.py
27+
pytest test/quantization/test_quant_api.py
28+
pytest test/dtypes/test_affine_quantized.py
29+
30+
# Run comprehensive float8 tests
31+
./test/float8/test_everything.sh
32+
33+
# Run all tutorials
34+
./tutorials/run_all.sh
35+
```
36+
37+
### Linting & Formatting
38+
```bash
39+
# Install pre-commit hooks (one-time setup)
40+
pre-commit install
41+
42+
# Run all pre-commit checks
43+
pre-commit run --all-files
44+
45+
# Run pre-commit on staged files only
46+
pre-commit run
47+
```
48+
49+
## Architecture Overview
50+
51+
### Core Components
52+
53+
**torchao/quantization/** - Primary quantization APIs
54+
- `quant_api.py` - Main `quantize_()` function for one-line model quantization
55+
- `autoquant.py` - Automatic quantization selection
56+
- Weight-only quantization (INT4/INT8), dynamic quantization, QAT support
57+
58+
**torchao/dtypes/** - Custom tensor subclasses with layout and dispatch registration
59+
- `AffineQuantizedTensor` - Base quantized tensor class
60+
- `nf4tensor.py` - NF4 (4-bit normal float) implementation for QLoRA
61+
- `uintx/floatx/` - Unsigned integer and floating-point quantized tensors
62+
63+
**torchao/float8/** - High-performance float8 training
64+
- Delivers up to 1.5x speedup on 512 GPU clusters
65+
- `convert_to_float8_training()` - Main entry point
66+
- Full `torch.compile` and FSDP2 compatibility
67+
68+
**torchao/sparsity/** - Structured and unstructured sparsity
69+
- 2:4 semi-structured sparsity with up to 2.4x throughput improvements
70+
- `sparse_api.py` - Main sparsity functions
71+
- Wanda pruning, block-sparse operations
72+
73+
**torchao/optim/** - Memory-efficient optimizers
74+
- `AdamW8bit`, `AdamW4bit`, `AdamWFp8` - Quantized optimizers (2-4x memory reduction)
75+
- `CPUOffloadOptimizer` - 60% VRAM reduction via CPU offloading
76+
77+
**torchao/csrc/** - Custom CUDA/CPU kernels
78+
- CUTLASS-based implementations for maximum performance
79+
- ROCm support for AMD GPUs
80+
- CPU kernels with AVX512 optimizations
81+
82+
### Key Design Principles
83+
84+
**Composability**: All custom dtypes work with `torch.compile`, FSDP2, and tensor parallel out-of-the-box
85+
86+
**Subclass Architecture**: Tensor subclasses handle layout, dispatch, and kernel registration automatically
87+
88+
**Hardware Optimization**: Architecture-specific optimizations (CUDA, ROCm, CPU, MPS) with automatic detection
89+
90+
## Build Configuration
91+
92+
The build system uses environment variables for configuration:
93+
94+
**Core Controls:**
95+
- `USE_CPP=0|1` - Skip C++/CUDA extensions (default: 1, set to 0 for fastest dev setup)
96+
- `USE_CPU_KERNELS=0|1` - Enable optimized CPU kernels (Linux only, default: 0)
97+
- `DEBUG=0|1` - Debug build mode
98+
99+
**Experimental Features:**
100+
- `BUILD_TORCHAO_EXPERIMENTAL=1` - Enable experimental cmake builds
101+
- `TORCHAO_BUILD_CPU_AARCH64=1` - ARM64 CPU kernels (auto-enabled on Apple Silicon)
102+
- `TORCHAO_BUILD_KLEIDIAI=1` - Kleidi AI library integration
103+
- `TORCHAO_BUILD_EXPERIMENTAL_MPS=1` - MPS acceleration (macOS only)
104+
105+
## Integration Points
106+
107+
- **HuggingFace Transformers**: Built-in backend via `TorchAoConfig`
108+
- **vLLM/SGLang**: LLM serving integration
109+
- **TorchTune**: QLoRA and QAT recipes
110+
- **torch.compile**: Full compiler compatibility
111+
- **FSDP2**: Distributed training support
112+
113+
## Common Development Tasks
114+
115+
**Adding a new quantization technique:** Implement as tensor subclass in `torchao/dtypes/`, register dispatch kernels, add to `quant_api.py`
116+
117+
**Performance optimization:** Custom kernels go in `torchao/csrc/`, with separate extensions for different GPU architectures (SM90a, SM100a)
118+
119+
**Testing:** Follow existing patterns in `test/` directory, use `pytest` for individual tests
120+
121+
## Important Notes
122+
123+
- Always run `pre-commit run --all-files` before committing
124+
- Use `USE_CPP=0` for faster iteration during Python-only development
125+
- CUTLASS kernels have architecture-specific builds (SM90a, SM100a) based on CUDA version
126+
- Git submodules (CUTLASS) are automatically initialized during build

0 commit comments

Comments
 (0)