Skip to content

AMD integration tracker #1260

Open
Open
@jcaip

Description

@jcaip

Community interest in AMD

Collecting all the requests for better AMD support for AO here:

Model Performance Comparison

Model Technique Tokens/Second Relative Speedup Peak Memory (GB) Model Size (GB)
Llama-3-8B Base (bfloat16) 126.9 100.00% 16.75 15.01
H100 int8wo 198.85 156.70% 11.05 7.52
  int4wo-64 241.39 190.22% 7.08 4.22
  float8wo 178.46 140.63% 12.09 7.51
  float8dq per(tensor) 116.4 91.73% 11.14 7.51
  float8dq (per-row) 154.63 121.85% 11.14 7.51
Llama-3-8B Base (bfloat16) 159.81 100.00% 16.6 15.01
AMD MI300X int8wo 179.38 112.25% 10.8 7.52
  int4wo-64 46.43 25.88% 6.57 4.22
  float8wo 177.23 110.90% 11.83 7.51
  float8dq per(tensor) 51.66 32.33% 12.98 7.51
  float8dq (per-row) 141.72 88.68% 12.98 7.51

TODO:

  • tinyGEMM: int4wo quantization seems to struggle outside of GEMM performance - It looks like we are not cuda/hip graphing properly. Same issue looks like it may be happening for FP8 per tensor quantization.
  • sparse-marlin: Need to fix compiliation issues outlined in [wip] sparse marlin rocm compilation #1847
  • Fp8 weight only, int8 weight only: Seeing low tok/s on initial warm-up runs, need to root cause this issue. Maybe some caching going on?

DONE:

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions