Skip to content

AMD MI300A Unified Memory Support #145693

Open
@lancelotnd

Description

@lancelotnd

🚀 The feature, motivation and pitch

I am working on improving performance for LLM workloads on AMD Instinct™ MI300A Accelerator. This APU is equipped with a fully unified-memory architecture not taken advantage by PyTorch at this time. Because both the GPU and CPU share the same physical memory, memcpy ops become redundant and can limit the size of the model we can train. Adding support for unified memory on ROCm for that particular APU would allow for zero-copy operations.

The motivation is similar that of #140787, but for ROCm instead of MPS.

Given that this APU is targeted for the most demanding HPC ML workloads, there is a great interest in optimizing the performance of PyTorch for it. Notably, El Capitan, the top1 Supercomputer from top500 runs exclusively with AMD's MI300A.

Alternatives

No response

Additional context

To facilitate understanding I provide more details as to the kind of changes this involves.

To understand the differences in operations between non-unified and unified memory, let us consider a regular matrix multiplication of matrices $A$ and $B$ where the result is stored in matrix $C$.

In a non-unified setup with a discrete GPU (device):

  1. malloc matrices $A,B,C$ on the host each of size $n \times n$.
  2. ... values are written on matrices $A$,$B$
  3. CudaMalloc to allocate memory for matrices $A',B',C'$
  4. CudaMemcpy $A \rightarrow A'$, and $B \rightarrow B'$ (hostToDevice)
  5. Kernel launch (results are written in $C'$).
  6. CudaMemcpy $C' \rightarrow C$ (DeviceToHost) to get back the results

Whereas with unified-memory you would have:

  1. CudaMallocManaged matrices $A,B,C$ on the host each of size $n \times n$.
  2. ... values are written on matrices $A$,$B$
  3. Kernel launch (results are written in $C$).

On machines with discrete GPUs the concept of unified-memory is purely virtual and still results in memory movement by way of page faults and page migrations. This adds a lot of overhead.

On architectures where CPU and GPU share the same physical memory such as the MAC, AMD MI300A, any memcpy operation becomes pointless and wastes space and time.

The quickest and dirtiest hack for the support of unified memory on Pytorch is to replace all cudaMalloc by cudaMallocManaged and to get rid of memCopy operations. Like in this paper. This however is not ideal, nor portable.

Perhaps a better way to do it would be to toggle on or off the unified memory. Given that this is a relatively new architecture, more hardware is likely to come out with such configuration from different manufacturers and so it would be great to have a device-agnostic support of unified memory.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureA request for a proper, new feature.module: rocmAMD GPU support for Pytorchtopic: new featurestopic categorytriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions