Fix VFIO discovery and Unconfigure for pre-bound GPUs#1090
Conversation
✅ Deploy Preview for dra-driver-nvidia-gpu ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: johnahull The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @johnahull. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
d85c730 to
7c59221
Compare
Previously, enumerateGpuVfioDevices treated any NVIDIA GPU not on the nvidia driver as a VFIO candidate. This caused driverless GPUs (stuck after a failed unbind) and nvidia-bound GPUs to be advertised in the ResourceSlice as allocatable VFIO devices. When the scheduler picked one, the prepare would fail or hang trying to unbind from nvidia. Check the actual kernel driver binding via sysfs before adding a GPU to the VFIO device list. Only GPUs currently bound to vfio-pci are advertised.
Track the driver binding before Configure and check it in Unconfigure. If the GPU was already on vfio-pci (pre-bound at boot via kernel cmdline), leave it on vfio-pci instead of rebinding to nvidia. On H100 SXM5 systems with NVLink, rebinding to nvidia hangs indefinitely during fabric reconfiguration.
The VfioPciManager checks /sys/module/vfio_pci and /sys/kernel/ iommu_groups to verify module loading and IOMMU support. In containers where /host-root is bind-mounted from host /, the container's own /sys mount doesn't expose host sysfs at /host-root/sys. Fall back to checking the unprefixed sysfs path when the host-root prefixed path doesn't exist.
- Remove duplicate /dev/vfio/vfio append in GetCommonEdits else branch (already added unconditionally above) - Extract checkIommuEnabledAt to avoid defer f.Close() inside loop
7c59221 to
71e8efd
Compare
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fixes two bugs in the VFIO passthrough lifecycle:
VFIO discovery advertises non-vfio GPUs —
enumerateGpuVfioDevicestreats any GPU not on the nvidia driver as a VFIO candidate, including GPUs still bound to nvidia. On a system with 2 A40s (one on nvidia, one on vfio-pci), both are advertised as VFIO devices. If the scheduler allocates the nvidia-bound GPU, Configure has to unbind and rebind at runtime, which can hang on NVLink systems. The fix adds agetDriver()check — only GPUs actually bound tovfio-pciare advertised.Unconfigure rebinds pre-bound GPUs — When a GPU was already on
vfio-pcibeforeConfigureran (pre-bound at boot),Unconfiguretries to rebind it to nvidia, which hangs indefinitely on NVLink systems during fabric reconfiguration. The fix tracks the pre-Configure driver binding and skips the rebind if the GPU was already onvfio-pci.Changes from v1:
Reworked based on @varunrsekar's feedback:
/dev/vfio/vfioCDI change — use DeviceClass config withenableAPIDevice: trueinstead/host-rootcorrectlyvfio-pci.idskernel cmdline detection — will file as separate feature requestpreConfigureDrivertracking (issue 3)How to verify it:
On a multi-GPU system with some GPUs on nvidia and some on vfio-pci:
gpu-vfio-*in ResourceSlice regardless of driver bindingvfio-pciappear asgpu-vfio-*Which issue(s) this PR fixes:
Fixes #1089 (issues 2 and 3)