AMD integration tracker

## Community interest in AMD

Collecting all the requests for better AMD support for AO here:

* [Reddit](https://www.reddit.com/r/LocalLLaMA/comments/1fteod6/pytorch_native_architecture_optimization_torchao/lprdwzm/) feedback
* https://github.com/pytorch/ao/issues/958
* https://github.com/pytorch/ao/issues/1066
* Issues in GPT-fast: https://github.com/pytorch-labs/gpt-fast/issues/12, https://github.com/pytorch-labs/gpt-fast/issues/78, https://github.com/pytorch-labs/gpt-fast/issues/6

<html><head></head><body><h2 id="modelperformancecomparison"><strong>Model Performance Comparison</strong></h1>

Model | Technique | Tokens/Second | Relative Speedup | Peak Memory (GB) | Model Size (GB)
-- | -- | -- | -- | -- | --
Llama-3-8B | Base (bfloat16) | 126.9 | 100.00% | 16.75 | 15.01
H100 | int8wo | 198.85 | 156.70% | 11.05 | 7.52
  | int4wo-64 | 241.39 | 190.22% | 7.08 | 4.22
  | float8wo | 178.46 | 140.63% | 12.09 | 7.51
  | float8dq per(tensor) | 116.4 | 91.73% | 11.14 | 7.51
  | float8dq (per-row) | 154.63 | 121.85% | 11.14 | 7.51
Llama-3-8B | Base (bfloat16) | 159.81 | 100.00% | 16.6 | 15.01
AMD MI300X | int8wo | 179.38 | 112.25% | 10.8 | 7.52
  | int4wo-64 | 46.43 | 25.88% | 6.57 | 4.22
  | float8wo | 177.23 | 110.90% | 11.83 | 7.51
  | float8dq per(tensor) | 51.66 | 32.33% | 12.98 | 7.51
  | float8dq (per-row) | 141.72 | 88.68% | 12.98 | 7.51

</body></html>




## TODO:
* [ ] tinyGEMM: int4wo quantization seems to struggle outside of GEMM performance - It looks like we are not cuda/hip graphing properly. Same issue looks like it may be happening for FP8 per tensor quantization. 
* [ ] sparse-marlin: Need to fix compiliation issues outlined in #1847 
* [ ] Fp8 weight only, int8 weight only: Seeing low tok/s on initial warm-up runs, need to root cause this issue. Maybe some caching going on? 

## DONE:
* https://github.com/pytorch/ao/pull/1201
* https://github.com/pytorch/ao/pull/1206
* https://github.com/pytorch/ao/pull/1677 - Add FP8 support. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMD integration tracker #1260

Community interest in AMD

Model Performance Comparison

TODO:

DONE:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Technique	Tokens/Second	Relative Speedup	Peak Memory (GB)	Model Size (GB)
Llama-3-8B	Base (bfloat16)	126.9	100.00%	16.75	15.01
H100	int8wo	198.85	156.70%	11.05	7.52
	int4wo-64	241.39	190.22%	7.08	4.22
	float8wo	178.46	140.63%	12.09	7.51
	float8dq per(tensor)	116.4	91.73%	11.14	7.51
	float8dq (per-row)	154.63	121.85%	11.14	7.51
Llama-3-8B	Base (bfloat16)	159.81	100.00%	16.6	15.01
AMD MI300X	int8wo	179.38	112.25%	10.8	7.52
	int4wo-64	46.43	25.88%	6.57	4.22
	float8wo	177.23	110.90%	11.83	7.51
	float8dq per(tensor)	51.66	32.33%	12.98	7.51
	float8dq (per-row)	141.72	88.68%	12.98	7.51

AMD integration tracker #1260

Description

Community interest in AMD

Model Performance Comparison

TODO:

DONE:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions