Open
Description
Community interest in AMD
Collecting all the requests for better AMD support for AO here:
- Reddit feedback
- optimizer CPU offload doesn't work outside of CUDA #958
- [ROCm] torchao.float8 should work properly on ROCm #1066
- Issues in GPT-fast: GPTQ quantization not working pytorch-labs/gpt-fast#12, Code is extremely slow! pytorch-labs/gpt-fast#78, AMD quantize pytorch-labs/gpt-fast#6
Model Performance Comparison
Model | Technique | Tokens/Second | Relative Speedup | Peak Memory (GB) | Model Size (GB) |
---|---|---|---|---|---|
Llama-3-8B | Base (bfloat16) | 126.9 | 100.00% | 16.75 | 15.01 |
H100 | int8wo | 198.85 | 156.70% | 11.05 | 7.52 |
int4wo-64 | 241.39 | 190.22% | 7.08 | 4.22 | |
float8wo | 178.46 | 140.63% | 12.09 | 7.51 | |
float8dq per(tensor) | 116.4 | 91.73% | 11.14 | 7.51 | |
float8dq (per-row) | 154.63 | 121.85% | 11.14 | 7.51 | |
Llama-3-8B | Base (bfloat16) | 159.81 | 100.00% | 16.6 | 15.01 |
AMD MI300X | int8wo | 179.38 | 112.25% | 10.8 | 7.52 |
int4wo-64 | 46.43 | 25.88% | 6.57 | 4.22 | |
float8wo | 177.23 | 110.90% | 11.83 | 7.51 | |
float8dq per(tensor) | 51.66 | 32.33% | 12.98 | 7.51 | |
float8dq (per-row) | 141.72 | 88.68% | 12.98 | 7.51 |
TODO:
- tinyGEMM: int4wo quantization seems to struggle outside of GEMM performance - It looks like we are not cuda/hip graphing properly. Same issue looks like it may be happening for FP8 per tensor quantization.
- sparse-marlin: Need to fix compiliation issues outlined in [wip] sparse marlin rocm compilation #1847
- Fp8 weight only, int8 weight only: Seeing low tok/s on initial warm-up runs, need to root cause this issue. Maybe some caching going on?