Component
Other
Problem Statement
In order to support the most efficient data transfers between GPUs over the network, each GPU must be paired with a NIC that is optimally placed in the PCI topology to enable GPUDirect. In the first version of the SRIOV VF DRA driver, the pairing was done by the scheduler using a constraint on the PCIRoot attribute. However, new generations of hardware have broken the assumption that the NIC and GPU are colocated in the same PCI switch.
The ConnectX-8 introduces a new feature, Data Direct : docs.nvidia.com/multi-node-nvlink-systems/grace-blackwell-cx8-gpudirect-rdma-guide/gpudirect_rdma_testing.html
Proposed Solution
In order to support data direct setups like GB300 and some GB200, and add proper support for other setups with Data Direct but with the same concept of inline complexes, we need to add a new attribute to match GPUs and NICs. A solution could be to use rdma_topo or an equivalent library : https://github.com/linux-rdma/rdma-core/blob/master/kernel-boot/rdma_topo . This tool can output the NVCX complexes, so we could select an identifier for a complex, that would match for all GPUs and NICs belonging to that complex. That would allow matching GPUs and NICs in pairs even in complex topologies with multiple NICs per GPUs or multiple GPUs per NICs.
In addition, adding an attribute with the NUMA node the GPU is connected to would allow matching with the NIC in VR setups while support for those is being added through rdma_topo.
Alternatives Considered
Creating a new DRA driver for the data direct DMA function would solve the issue, but would require two step matching for the scheduler (GPU to DMA function, DMA function to the NIC), while the DMA function itself is not needed in a container for proper acceleration of the traffic. Additionally, it wouldn't support setups without the data direct interface.
Scope
Small: CLI flag, config option, minor behavior change
Upstream Kubernetes Dependencies
No response
Additional Context
No response
Component
Other
Problem Statement
In order to support the most efficient data transfers between GPUs over the network, each GPU must be paired with a NIC that is optimally placed in the PCI topology to enable GPUDirect. In the first version of the SRIOV VF DRA driver, the pairing was done by the scheduler using a constraint on the PCIRoot attribute. However, new generations of hardware have broken the assumption that the NIC and GPU are colocated in the same PCI switch.
The ConnectX-8 introduces a new feature, Data Direct : docs.nvidia.com/multi-node-nvlink-systems/grace-blackwell-cx8-gpudirect-rdma-guide/gpudirect_rdma_testing.html
Proposed Solution
In order to support data direct setups like GB300 and some GB200, and add proper support for other setups with Data Direct but with the same concept of inline complexes, we need to add a new attribute to match GPUs and NICs. A solution could be to use rdma_topo or an equivalent library : https://github.com/linux-rdma/rdma-core/blob/master/kernel-boot/rdma_topo . This tool can output the NVCX complexes, so we could select an identifier for a complex, that would match for all GPUs and NICs belonging to that complex. That would allow matching GPUs and NICs in pairs even in complex topologies with multiple NICs per GPUs or multiple GPUs per NICs.
In addition, adding an attribute with the NUMA node the GPU is connected to would allow matching with the NIC in VR setups while support for those is being added through rdma_topo.
Alternatives Considered
Creating a new DRA driver for the data direct DMA function would solve the issue, but would require two step matching for the scheduler (GPU to DMA function, DMA function to the NIC), while the DMA function itself is not needed in a container for proper acceleration of the traffic. Additionally, it wouldn't support setups without the data direct interface.
Scope
Small: CLI flag, config option, minor behavior change
Upstream Kubernetes Dependencies
No response
Additional Context
No response