Skip to content

Fix VFIO discovery and Unconfigure for pre-bound GPUs#1090

Open
johnahull wants to merge 5 commits into
kubernetes-sigs:mainfrom
johnahull:fix/vfio-lifecycle
Open

Fix VFIO discovery and Unconfigure for pre-bound GPUs#1090
johnahull wants to merge 5 commits into
kubernetes-sigs:mainfrom
johnahull:fix/vfio-lifecycle

Conversation

@johnahull
Copy link
Copy Markdown

@johnahull johnahull commented May 1, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes two bugs in the VFIO passthrough lifecycle:

  1. VFIO discovery advertises non-vfio GPUsenumerateGpuVfioDevices treats any GPU not on the nvidia driver as a VFIO candidate, including GPUs still bound to nvidia. On a system with 2 A40s (one on nvidia, one on vfio-pci), both are advertised as VFIO devices. If the scheduler allocates the nvidia-bound GPU, Configure has to unbind and rebind at runtime, which can hang on NVLink systems. The fix adds a getDriver() check — only GPUs actually bound to vfio-pci are advertised.

  2. Unconfigure rebinds pre-bound GPUs — When a GPU was already on vfio-pci before Configure ran (pre-bound at boot), Unconfigure tries to rebind it to nvidia, which hangs indefinitely on NVLink systems during fabric reconfiguration. The fix tracks the pre-Configure driver binding and skips the rebind if the GPU was already on vfio-pci.

Changes from v1:

Reworked based on @varunrsekar's feedback:

  • Dropped /dev/vfio/vfio CDI change — use DeviceClass config with enableAPIDevice: true instead
  • Dropped sysfs fallback for vfio_pci/IOMMU checks — helm chart deployment mounts /host-root correctly
  • Dropped vfio-pci.ids kernel cmdline detection — will file as separate feature request
  • Kept VFIO discovery driver filter (issue 2) and preConfigureDriver tracking (issue 3)

How to verify it:

On a multi-GPU system with some GPUs on nvidia and some on vfio-pci:

  • Before fix: all GPUs appear as gpu-vfio-* in ResourceSlice regardless of driver binding
  • After fix: only GPUs bound to vfio-pci appear as gpu-vfio-*

Which issue(s) this PR fixes:

Fixes #1089 (issues 2 and 3)

Fix VFIO device discovery to only advertise GPUs bound to vfio-pci, and skip Unconfigure rebind for pre-bound GPUs.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels May 1, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 1, 2026

Deploy Preview for dra-driver-nvidia-gpu ready!

Name Link
🔨 Latest commit 71e8efd
🔍 Latest deploy log https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/6a020110053c2200089a2a6c
😎 Deploy Preview https://deploy-preview-1090--dra-driver-nvidia-gpu.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 1, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: johnahull
Once this PR has been reviewed and has the lgtm label, please assign jgehrcke for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 1, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @johnahull. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 1, 2026
@johnahull johnahull marked this pull request as ready for review May 1, 2026 19:59
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 1, 2026
@johnahull johnahull force-pushed the fix/vfio-lifecycle branch from d85c730 to 7c59221 Compare May 4, 2026 19:41
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 4, 2026
@johnahull johnahull changed the title Fix VFIO passthrough lifecycle — CDI spec, discovery, unconfigure, sysfs Fix VFIO discovery and Unconfigure for pre-bound GPUs May 4, 2026
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2026
johnahull added 5 commits May 11, 2026 11:16
Previously, enumerateGpuVfioDevices treated any NVIDIA GPU not on the
nvidia driver as a VFIO candidate. This caused driverless GPUs (stuck
after a failed unbind) and nvidia-bound GPUs to be advertised in the
ResourceSlice as allocatable VFIO devices. When the scheduler picked
one, the prepare would fail or hang trying to unbind from nvidia.

Check the actual kernel driver binding via sysfs before adding a GPU
to the VFIO device list. Only GPUs currently bound to vfio-pci are
advertised.
Track the driver binding before Configure and check it in Unconfigure.
If the GPU was already on vfio-pci (pre-bound at boot via kernel
cmdline), leave it on vfio-pci instead of rebinding to nvidia. On
H100 SXM5 systems with NVLink, rebinding to nvidia hangs indefinitely
during fabric reconfiguration.
The VfioPciManager checks /sys/module/vfio_pci and /sys/kernel/
iommu_groups to verify module loading and IOMMU support. In containers
where /host-root is bind-mounted from host /, the container's own
/sys mount doesn't expose host sysfs at /host-root/sys. Fall back to
checking the unprefixed sysfs path when the host-root prefixed path
doesn't exist.
- Remove duplicate /dev/vfio/vfio append in GetCommonEdits else branch
  (already added unconditionally above)
- Extract checkIommuEnabledAt to avoid defer f.Close() inside loop
@johnahull johnahull force-pushed the fix/vfio-lifecycle branch from 7c59221 to 71e8efd Compare May 11, 2026 16:17
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug]: VFIO discovery advertises non-vfio GPUs and Unconfigure rebinds pre-bound GPUs

2 participants