AMD MI300A Unified Memory Support

### 🚀 The feature, motivation and pitch

I am working on improving performance for LLM workloads on AMD Instinct™ MI300A Accelerator. This APU is equipped with a fully unified-memory architecture not taken advantage by PyTorch at this time. Because both the GPU and CPU share the same physical memory, memcpy ops become redundant and can limit the size of the model we can train. Adding support for unified memory on ROCm for that particular APU  would allow for zero-copy operations.

The motivation is similar that of #140787, but for ROCm instead of MPS.

Given that this APU is targeted for the most demanding HPC ML workloads, there is a great interest in optimizing the performance of PyTorch for it. Notably, El Capitan, the top1 Supercomputer from [top500](https://top500.org/) runs exclusively with AMD's MI300A.

### Alternatives

_No response_

### Additional context

To facilitate understanding I provide more details as to the kind of changes this involves.

To understand the differences in operations between non-unified and unified memory, let us consider a regular matrix multiplication of matrices $A$ and $B$ where the result is stored in matrix $C$. 

In a non-unified setup with a discrete GPU (device):

1. `malloc` matrices $A,B,C$ on the host each of size $n \times n$.
2. ... values are written on matrices $A$,$B$
3. `CudaMalloc` to allocate memory for matrices $A',B',C'$ 
4. `CudaMemcpy` $A \rightarrow A'$, and $B \rightarrow B'$ (`hostToDevice`)
5. Kernel launch (results are written in $C'$).
6. `CudaMemcpy` $C' \rightarrow C$ (`DeviceToHost`) to get back the results

Whereas with unified-memory you would have:

1. `CudaMallocManaged` matrices $A,B,C$ on the host each of size $n \times n$.
2. ... values are written on matrices $A$,$B$
5. Kernel launch (results are written in $C$).


On machines with discrete GPUs the concept of unified-memory is purely virtual and still results in memory movement by way of page faults and page migrations. This adds a lot of overhead.

On architectures where CPU and GPU share the same physical memory such as the MAC, AMD MI300A, any memcpy operation becomes pointless and wastes space and time.


The quickest and dirtiest *hack* for the support of unified memory on Pytorch is to replace all `cudaMalloc` by `cudaMallocManaged` and to get rid of `memCopy` operations. [Like in this paper](https://doi.org/10.1109/ACSOS-C52956.2021.00029). This however is not ideal, nor portable.


Perhaps a better way to do it would be to toggle on or off the unified memory. Given that this is a relatively new architecture, more hardware is likely to come out with such configuration from different manufacturers and so it would be great to have a device-agnostic support of unified memory.



cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMD MI300A Unified Memory Support #145693

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AMD MI300A Unified Memory Support #145693

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions