Skip to content

[Feature]: Add support for NVCX complex attributes #1123

@maelk

Description

@maelk

Component

Other

Problem Statement

In order to support the most efficient data transfers between GPUs over the network, each GPU must be paired with a NIC that is optimally placed in the PCI topology to enable GPUDirect. In the first version of the SRIOV VF DRA driver, the pairing was done by the scheduler using a constraint on the PCIRoot attribute. However, new generations of hardware have broken the assumption that the NIC and GPU are colocated in the same PCI switch.

The ConnectX-8 introduces a new feature, Data Direct : docs.nvidia.com/multi-node-nvlink-systems/grace-blackwell-cx8-gpudirect-rdma-guide/gpudirect_rdma_testing.html

Proposed Solution

In order to support data direct setups like GB300 and some GB200, and add proper support for other setups with Data Direct but with the same concept of inline complexes, we need to add a new attribute to match GPUs and NICs. A solution could be to use rdma_topo or an equivalent library : https://github.com/linux-rdma/rdma-core/blob/master/kernel-boot/rdma_topo . This tool can output the NVCX complexes, so we could select an identifier for a complex, that would match for all GPUs and NICs belonging to that complex. That would allow matching GPUs and NICs in pairs even in complex topologies with multiple NICs per GPUs or multiple GPUs per NICs.

In addition, adding an attribute with the NUMA node the GPU is connected to would allow matching with the NIC in VR setups while support for those is being added through rdma_topo.

Alternatives Considered

Creating a new DRA driver for the data direct DMA function would solve the issue, but would require two step matching for the scheduler (GPU to DMA function, DMA function to the NIC), while the DMA function itself is not needed in a container for proper acceleration of the traffic. Additionally, it wouldn't support setups without the data direct interface.

Scope

Small: CLI flag, config option, minor behavior change

Upstream Kubernetes Dependencies

No response

Additional Context

No response

Metadata

Metadata

Assignees

Labels

needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.size/small

Type

No type
No fields configured for issues without a type.

Projects

Status
Backlog

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions