Retry device enumeration on startup to prevent empty ResourceSlices by kasia-kujawa · Pull Request #1009 · kubernetes-sigs/dra-driver-nvidia-gpu

kasia-kujawa · 2026-04-10T11:46:49Z

Added a retry loop in NewDeviceState().
If the first enumeration returns 0 devices, the plugin retries every 5 seconds for up to 5 minutes before proceeding.
Errors still propagate immediately without retry.

Before the fix, no log was emitted after Traverse GPU devices and the empty ResourceSlice was published silently.

With the fix (nvidiaDriverRoot: /home/kubernetes/bin/nvidia/, GKE COS, Tesla T4):

I0410 08:11:16.614833  1 nvlib.go:197] Traverse GPU devices
I0410 08:11:16.779628  1 device_state.go:96] No GPU devices found yet (driver may still be initializing), retrying in 5s...
I0410 08:11:21.780288  1 nvlib.go:197] Traverse GPU devices
I0410 08:11:23.111832  1 nvlib.go:278] Adding device gpu-0 to allocatable devices
I0410 08:11:23.111862  1 allocatable.go:254] Adding allocatables for PCI bus ID: 0000:00:04.0

Full logs with the fix from nvidia-dra-driver-gpu-kubelet-plugin when GPU initialization needed slightly more time:
https://gist.github.com/kasia-kujawa/1082b48357a0ae80d663f12ee665e34c

ArangoGutierrez

Solid approach to the GKE-COS / slow-driver-init race — the real-life log evidence in the description is clean. Two structural concerns plus some cleanups, left inline. Requesting changes primarily on the silent empty-slice publish on timeout and the 5-minute synchronous block on startup; the rest are suggestions.

kasia-kujawa · 2026-04-21T14:16:24Z

@ArangoGutierrez @varunrsekar Thanks for the review! ❤️

I'm not sure if I'll have time this week to address your comments, but I definitely will next week.

jgehrcke · 2026-04-24T12:04:38Z

Even if not yet fully understood, #1008 really is an intriguing and interesting problem.

I don't want to undermine any work here, and please consider my feedback as just one item on the feedback shelf. My gut feeling is that if we have to do any retrying that we should do that in the init container, and not in the plugin startup code. Once we understand what would it take to detect that situation over there I think it's also easy to just "wait a little longer".

Such change can be looked at as

just refining the condition to wait for in the init container
not changing the scope of responsibility of any component involved

Having to retry in the plugin startup code feels like the init container doesn't do its job.

kasia-kujawa · 2026-04-28T10:18:28Z

Even if not yet fully understood, #1008 really is an intriguing and interesting problem.

I don't want to undermine any work here, and please consider my feedback as just one item on the feedback shelf. My gut feeling is that if we have to do any retrying that we should do that in the init container, and not in the plugin startup code. Once we understand what would it take to detect that situation over there I think it's also easy to just "wait a little longer".

Such change can be looked at as

just refining the condition to wait for in the init container

not changing the scope of responsibility of any component involved

Having to retry in the plugin startup code feels like the init container doesn't do its job.

I initially thought about changes in the init container too but I didn't find good checks to add.
This comment pushed me to think more 😄 and I'll try the approach with checking in the init container if the kernel created the proper number of /dev files (e.g., /dev/nvidia0, /dev/nvidia1).

For the curious, I'm going to test this kasia-kujawa#8

I'll be back with results ⌛ 🧪

netlify · 2026-04-29T09:07:51Z

✅ Deploy Preview for dra-driver-nvidia-gpu ready!

Name	Link
🔨 Latest commit	`eac1262`
🔍 Latest deploy log	https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/6a05e141997e22000857b7f4
😎 Deploy Preview	https://deploy-preview-1009--dra-driver-nvidia-gpu.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-04-29T09:07:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kasia-kujawa
Once this PR has been reviewed and has the lgtm label, please assign varunrsekar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kasia-kujawa · 2026-05-05T14:08:47Z

@ArangoGutierrez @varunrsekar Please take another look.

I think I addressed all review comments except Scenario 2 mentioned in this comment #1009 (comment)

Scenario 2:

PassthroughSupport featuregate disabled
DRA plugin is being initialized for the first time
nvml returns some but not all GPUs
Decision: ??

I think we can skip this scenario in this pull request to limit the scope of changes it introduces and either add an implementation for it in another pull request or skip it now and add it in the future if needed.

I tried the approach of introducing an init container to check NVML initialization (in Go, in exactly the same way as in the kubelet plugin) but it turned out that successful initialization in the init container does not guarantee successful initialization in another container 😿

I could easily reproduce this state -> successful NVML initialization in the init container, but no GPU discovered in gpu-kubelet-plugin.

It seems that there is an issue either with loading the NVML library or with the communication between the NVML library and the kernel 🤔

Most important logs from my tests with additional init container:

 [init] gpu-readiness-init logs:
        I0430 06:17:14.863067       1 main.go:125] using driver library: /driver-root/lib64/libnvidia-ml.so.580.105.08
        I0430 06:17:14.863175       1 main.go:128] using devRoot: /
        I0430 06:17:16.635580       1 ???:1] "WARNING: unable to detect IOMMU FD for [0000:00:04.0 open /sys/bus/pci/devices/0000:00:04.0/vfio-dev: no such file or directory]: %!v(MISSING)"
        I0430 06:17:16.735256       1 main.go:296] found /dev/nvidia<N> node(s): [/dev/nvidia0]
        I0430 06:17:16.735311       1 main.go:215] found 1 GPU(s) via NVML, all /dev/nvidia* device nodes present
[container] gpus logs:
        I0430 06:17:18.459633       1 utils.go:44] Commit: 5c99e6b2d2ce116e01c3c9ae5c7025fd8efa435e

        Feature gates: map[string]bool{"AllAlpha":false, "AllBeta":false, "ComputeDomainCliques":true, "ContextualLogging":true, "CrashOnNVLinkFabricErrors":true, "DeviceMetadata":false, "DynamicMIG":false, "IMEXDaemonsWithDNSNames":true, "LoggingAlphaOptions":false, "LoggingBetaOptions":true, "MPSSupport":true, "NVMLDeviceHealthCheck":false, "PassthroughSupport":false, "TimeSlicingSettings":true}
        Flags: (*main.Flags)({
          kubeClientConfig: (flags.KubeClientConfig) {
            KubeConfig: (string) "",
            KubeAPIQPS: (float64) 5,
            KubeAPIBurst: (int) 10
          },
          nodeName: (string) (len=53) "gke-e3e-gke-autoscaler-kasia-04-30-cast-pool-306fc077",
          namespace: (string) (len=6) "nvidia",
          httpEndpoint: (string) (len=5) ":8080",
          metricsPath: (string) (len=8) "/metrics",
          cdiRoot: (string) (len=12) "/var/run/cdi",
          containerDriverRoot: (string) (len=12) "/driver-root",
          hostDriverRoot: (string) (len=28) "/home/kubernetes/bin/nvidia/",
          nvidiaCDIHookPath: (string) "",
          imageName: (string) (len=67) "ghcr.io/kasia-kujawa/k8s-dra-driver-gpu:additional-init-container-5",
          kubeletRegistrarDirectoryPath: (string) (len=33) "/var/lib/kubelet/plugins_registry",
          kubeletPluginsDirectoryPath: (string) (len=24) "/var/lib/kubelet/plugins",
          healthcheckPort: (int) 51516,
          klogVerbosity: (int) 4,
          additionalXidsToIgnore: (string) ""
        })
        I0430 06:17:18.460778       1 envvar.go:195] "Feature gate default state" feature="WatchListClient" enabled=true
        I0430 06:17:18.460869       1 envvar.go:195] "Feature gate default state" feature="AtomicFIFO" enabled=true
        I0430 06:17:18.460903       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowTLSCacheGC" enabled=true
        I0430 06:17:18.460928       1 envvar.go:195] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
        I0430 06:17:18.460969       1 envvar.go:195] "Feature gate default state" feature="UnlockWhileProcessingFIFO" enabled=true
        I0430 06:17:18.461000       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCARotation" enabled=true
        I0430 06:17:18.461010       1 envvar.go:195] "Feature gate default state" feature="InformerResourceVersion" enabled=true
        I0430 06:17:18.461015       1 envvar.go:195] "Feature gate default state" feature="InOrderInformers" enabled=true
        I0430 06:17:18.461020       1 envvar.go:195] "Feature gate default state" feature="InOrderInformersBatchProcess" enabled=true
        I0430 06:17:18.461025       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
        I0430 06:17:18.567470       1 util.go:68] Started debug signal handler(s)
        I0430 06:17:18.590249       1 device_state.go:79] Using devRoot=/
        I0430 06:17:18.590271       1 prometheus_httpserver.go:78] "Starting metrics HTTP server" endpoint=":8080" path="/metrics"
        I0430 06:17:18.590460       1 nvlib.go:198] Traverse GPU devices
        I0430 06:17:18.806616       1 device_state.go:97] Muting CDI logger (verbosity is smaller 7: 4)
        I0430 06:17:18.970373       1 device_state.go:133] Warming up CDI device spec cache for GPUs []
        I0430 06:17:18.982118       1 draplugin.go:738] "Starting"
        I0430 06:17:18.982430       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="dra" endpoint="/var/lib/kubelet/plugins/gpu.nvidia.com/dra.sock"
        I0430 06:17:18.982730       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="registrar" endpoint="/var/lib/kubelet/plugins_registry/gpu.nvidia.com-reg.sock"
        I0430 06:17:18.986343       1 resourceslicecontroller.go:619] "Starting ResourceSlice informer and waiting for it to sync" logger="ResourceSlice controller"
        I0430 06:17:18.986461       1 cleanup.go:125] Checkpointed RC cleanup: claims in PrepareStarted state: 0 (of 0)
        I0430 06:17:18.986574       1 health.go:103] starting healthcheck service at [::]:51516
        I0430 06:17:18.991100       1 reflector.go:425] "Starting reflector" logger="ResourceSlice controller" type="*v1.ResourceSlice" resyncPeriod="0s" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0430 06:17:18.991153       1 reflector.go:472] "Listing and watching" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0430 06:17:19.026711       1 reflector.go:1080] "Exiting watch because received the bookmark that marks the end of initial events stream" logger="ResourceSlice controller" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625" totalItems=1 duration="35.361425ms"
        I0430 06:17:19.026816       1 reflector.go:507] "Caches populated" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0430 06:17:19.986673       1 resourceslicecontroller.go:634] "ResourceSlice informer has synced" logger="ResourceSlice controller"
        I0430 06:17:19.986759       1 resourceslicecontroller.go:223] "Starting" logger="ResourceSlice controller"
        I0430 06:17:19.986796       1 driver.go:179] Current kubelet plugin registration status: plugin_registered:true

Full logs from my experiments with additional gpu init container and its implementation is here: kasia-kujawa#9

@varunrsekar Could you make sure if this pull request doesn't break anything in PassthroughSupport? I don't have any machine on which I can test it.

Full logs from the version with the fix introduced in this pull request when the first device discovery fails:

[init] init-container logs:
        create symlink: /driver-root -> /driver-root-parent/nvidia
        2026-05-05T13:21:57Z  /driver-root (/home/kubernetes/bin/nvidia/ on host): nvidia-smi: '/driver-root/bin/nvidia-smi', libnvidia-ml.so.1: not found, current contents: [NVIDIA-Linux-x86_64-580.105.08.run
        bin
        bin-workdir
        drivers
        drivers-workdir
        firmware
        gpu_driver_versions.bin
        lib64
        lib64-workdir
        nvidia-drivers-580.105.08.tgz
        nvidia-installer.log
        share
        vulkan].
        
        Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/home/kubernetes/bin/nvidia/') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.
        Hint: Directory /home/kubernetes/bin/nvidia/ is not empty but at least one of the binaries wasn't found.
        
        2026-05-05T13:22:07Z  /driver-root (/home/kubernetes/bin/nvidia/ on host): nvidia-smi: '/driver-root/bin/nvidia-smi', libnvidia-ml.so.1: '/driver-root/lib64/libnvidia-ml.so.1', current contents: [.cache
        NVIDIA-Linux-x86_64-580.105.08.run
        bin
        bin-workdir
        drivers
        drivers-workdir
        firmware
        gpu_driver_versions.bin
        lib64
        lib64-workdir
        nvidia-drivers-580.105.08.tgz
        nvidia-installer.log
        share
        vulkan].
        invoke: env -i LD_PRELOAD=/driver-root/lib64/libnvidia-ml.so.1 /driver-root/bin/nvidia-smi --version
        NVIDIA-SMI version  : 580.105.08
        NVML version        : 580.105
        DRIVER version      : 580.105.08
        CUDA Version        : N/A
        nvidia-smi returned with code 0: success, leave
[container] compute-domains logs:
        I0505 13:22:11.706484       1 utils.go:44] Commit: 0b6aefb41e3742b02e0dc9133f649e552b2b742d
        
        Feature gates: map[string]bool{"AllAlpha":false, "AllBeta":false, "ComputeDomainCliques":true, "ContextualLogging":true, "CrashOnNVLinkFabricErrors":true, "DeviceMetadata":false, "DynamicMIG":false, "IMEXDaemonsWithDNSNames":true, "LoggingAlphaOptions":false, "LoggingBetaOptions":true, "MPSSupport":false, "NVMLDeviceHealthCheck":false, "PassthroughSupport":false, "TimeSlicingSettings":false}
        Flags: (*main.Flags)({
          kubeClientConfig: (flags.KubeClientConfig) {
            KubeConfig: (string) "",
            KubeAPIQPS: (float64) 5,
            KubeAPIBurst: (int) 10
          },
          nodeName: (string) (len=53) "gke-pool-cb0e0d56",
          httpEndpoint: (string) "",
          metricsPath: (string) (len=8) "/metrics",
          namespace: (string) (len=6) "nvidia",
          cdiRoot: (string) (len=12) "/var/run/cdi",
          containerDriverRoot: (string) (len=12) "/driver-root",
          hostDriverRoot: (string) (len=28) "/home/kubernetes/bin/nvidia/",
          nvidiaCDIHookPath: (string) "",
          kubeletRegistrarDirectoryPath: (string) (len=33) "/var/lib/kubelet/plugins_registry",
          kubeletPluginsDirectoryPath: (string) (len=24) "/var/lib/kubelet/plugins",
          healthcheckPort: (int) 51515,
          klogVerbosity: (int) 4
        })
        I0505 13:22:11.710332       1 envvar.go:195] "Feature gate default state" feature="AtomicFIFO" enabled=true
        I0505 13:22:11.710462       1 envvar.go:195] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
        I0505 13:22:11.710476       1 envvar.go:195] "Feature gate default state" feature="InOrderInformers" enabled=true
        I0505 13:22:11.710484       1 envvar.go:195] "Feature gate default state" feature="InformerResourceVersion" enabled=true
        I0505 13:22:11.710490       1 envvar.go:195] "Feature gate default state" feature="UnlockWhileProcessingFIFO" enabled=true
        I0505 13:22:11.710495       1 envvar.go:195] "Feature gate default state" feature="InOrderInformersBatchProcess" enabled=true
        I0505 13:22:11.710501       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCARotation" enabled=true
        I0505 13:22:11.710751       1 envvar.go:195] "Feature gate default state" feature="WatchListClient" enabled=true
        I0505 13:22:11.710768       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
        I0505 13:22:11.710774       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowTLSCacheGC" enabled=true
        I0505 13:22:11.785895       1 util.go:68] Started debug signal handler(s)
        I0505 13:22:12.541434       1 mount_linux.go:326] 'umount /tmp/kubelet-detect-safe-umount3962003754' failed with: exit status 1, output: umount: can't unmount /tmp/kubelet-detect-safe-umount3962003754: Invalid argument
        I0505 13:22:12.541504       1 mount_linux.go:328] Detected umount with unsafe 'not mounted' behavior
        I0505 13:22:12.543999       1 device_state.go:696] Starting driver version validation for IMEXDaemonsWithDNSNames feature...
        I0505 13:22:12.544027       1 device_state.go:697] Minimum required version: 570.158.01
        I0505 13:22:12.785184       1 device_state.go:715] Driver version validation passed: 580.105.8 >= 570.158.1
        I0505 13:22:12.787686       1 device_state.go:84] using devRoot=/
        ERROR: init 250 result=11ERROR: init 250 result=11I0505 13:22:13.011219       1 device_state.go:146] Create empty checkpoint
        I0505 13:22:13.115488       1 draplugin.go:738] "Starting"
        I0505 13:22:13.155986       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="dra" endpoint="/var/lib/kubelet/plugins/compute-domain.nvidia.com/dra.sock"
        I0505 13:22:13.156195       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="registrar" endpoint="/var/lib/kubelet/plugins_registry/compute-domain.nvidia.com-reg.sock"
        I0505 13:22:13.172278       1 reflector.go:425] "Starting reflector" type="*v1beta1.ComputeDomain" resyncPeriod="10m0s" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141"
        I0505 13:22:13.172351       1 reflector.go:472] "Listing and watching" type="*v1beta1.ComputeDomain" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141"
        I0505 13:22:13.242278       1 reflector.go:1080] "Exiting watch because received the bookmark that marks the end of initial events stream" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141" totalItems=1 duration="69.876632ms"
        I0505 13:22:13.242388       1 reflector.go:507] "Caches populated" type="*v1beta1.ComputeDomain" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141"
        I0505 13:22:13.264404       1 resourceslicecontroller.go:619] "Starting ResourceSlice informer and waiting for it to sync" logger="ResourceSlice controller"
        I0505 13:22:13.264418       1 health.go:102] Starting healthcheck server on [::]:51515
        I0505 13:22:13.264511       1 reflector.go:425] "Starting reflector" logger="ResourceSlice controller" type="*v1.ResourceSlice" resyncPeriod="0s" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:13.264537       1 reflector.go:472] "Listing and watching" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:13.280332       1 cleanup.go:125] Checkpointed RC cleanup: claims in PrepareStarted state: 0 (of 0)
        I0505 13:22:13.289063       1 reflector.go:1080] "Exiting watch because received the bookmark that marks the end of initial events stream" logger="ResourceSlice controller" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625" totalItems=1 duration="24.487185ms"
        I0505 13:22:13.289185       1 reflector.go:507] "Caches populated" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:14.265278       1 resourceslicecontroller.go:634] "ResourceSlice informer has synced" logger="ResourceSlice controller"
        I0505 13:22:14.265436       1 resourceslicecontroller.go:223] "Starting" logger="ResourceSlice controller"
[container] gpus logs:
        I0505 13:22:11.948660       1 utils.go:44] Commit: 0b6aefb41e3742b02e0dc9133f649e552b2b742d
        
        Feature gates: map[string]bool{"AllAlpha":false, "AllBeta":false, "ComputeDomainCliques":true, "ContextualLogging":true, "CrashOnNVLinkFabricErrors":true, "DeviceMetadata":false, "DynamicMIG":false, "IMEXDaemonsWithDNSNames":true, "LoggingAlphaOptions":false, "LoggingBetaOptions":true, "MPSSupport":false, "NVMLDeviceHealthCheck":false, "PassthroughSupport":false, "TimeSlicingSettings":false}
        Flags: (*main.Flags)({
          kubeClientConfig: (flags.KubeClientConfig) {
            KubeConfig: (string) "",
            KubeAPIQPS: (float64) 5,
            KubeAPIBurst: (int) 10
          },
          nodeName: (string) (len=53) "gke-pool-cb0e0d56",
          namespace: (string) (len=6) "nvidia",
          httpEndpoint: (string) "",
          metricsPath: (string) (len=8) "/metrics",
          cdiRoot: (string) (len=12) "/var/run/cdi",
          containerDriverRoot: (string) (len=12) "/driver-root",
          hostDriverRoot: (string) (len=28) "/home/kubernetes/bin/nvidia/",
          nvidiaCDIHookPath: (string) "",
          imageName: (string) (len=63) "ghcr.io/kasia-kujawa/k8s-dra-driver-gpu:retries-in-background-1",
          kubeletRegistrarDirectoryPath: (string) (len=33) "/var/lib/kubelet/plugins_registry",
          kubeletPluginsDirectoryPath: (string) (len=24) "/var/lib/kubelet/plugins",
          healthcheckPort: (int) 51516,
          klogVerbosity: (int) 4,
          additionalXidsToIgnore: (string) "",
          deviceEnumerationRetrySteps: (int) 15,
          deviceEnumerationRetryMaxInterval: (time.Duration) 30000000000
        })
        I0505 13:22:11.950438       1 envvar.go:195] "Feature gate default state" feature="InformerResourceVersion" enabled=true
        I0505 13:22:11.950547       1 envvar.go:195] "Feature gate default state" feature="AtomicFIFO" enabled=true
        I0505 13:22:11.950594       1 envvar.go:195] "Feature gate default state" feature="InOrderInformers" enabled=true
        I0505 13:22:11.950667       1 envvar.go:195] "Feature gate default state" feature="InOrderInformersBatchProcess" enabled=true
        I0505 13:22:11.950704       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCARotation" enabled=true
        I0505 13:22:11.950758       1 envvar.go:195] "Feature gate default state" feature="UnlockWhileProcessingFIFO" enabled=true
        I0505 13:22:11.950792       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowTLSCacheGC" enabled=true
        I0505 13:22:11.950845       1 envvar.go:195] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
        I0505 13:22:11.950909       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
        I0505 13:22:11.950969       1 envvar.go:195] "Feature gate default state" feature="WatchListClient" enabled=true
        I0505 13:22:11.984215       1 util.go:68] Started debug signal handler(s)
        I0505 13:22:12.041700       1 device_state.go:94] Using devRoot=/
        I0505 13:22:12.041867       1 device_state.go:107] Muting CDI logger (verbosity is smaller 7: 4)
        I0505 13:22:12.497269       1 nvlib.go:197] Traverse GPU devices
        I0505 13:22:12.664423       1 device_state.go:1321] No GPU devices discovered on enumeration attempt; will retry in background
        I0505 13:22:12.664631       1 draplugin.go:738] "Starting"
        I0505 13:22:12.783646       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="dra" endpoint="/var/lib/kubelet/plugins/gpu.nvidia.com/dra.sock"
        I0505 13:22:12.784609       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="registrar" endpoint="/var/lib/kubelet/plugins_registry/gpu.nvidia.com-reg.sock"
        I0505 13:22:12.817241       1 health.go:103] starting healthcheck service at [::]:51516
        I0505 13:22:12.841200       1 resourceslicecontroller.go:619] "Starting ResourceSlice informer and waiting for it to sync" logger="ResourceSlice controller"
        I0505 13:22:12.842138       1 reflector.go:425] "Starting reflector" logger="ResourceSlice controller" type="*v1.ResourceSlice" resyncPeriod="0s" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:12.843037       1 reflector.go:472] "Listing and watching" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:12.878669       1 reflector.go:1080] "Exiting watch because received the bookmark that marks the end of initial events stream" logger="ResourceSlice controller" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625" totalItems=1 duration="35.558419ms"
        I0505 13:22:12.878812       1 reflector.go:507] "Caches populated" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:12.940780       1 cleanup.go:125] Checkpointed RC cleanup: claims in PrepareStarted state: 0 (of 0)
        I0505 13:22:13.842114       1 resourceslicecontroller.go:634] "ResourceSlice informer has synced" logger="ResourceSlice controller"
        I0505 13:22:13.842169       1 resourceslicecontroller.go:223] "Starting" logger="ResourceSlice controller"
        I0505 13:22:13.842696       1 nvlib.go:197] Traverse GPU devices
        I0505 13:22:13.842757       1 driver.go:184] Current kubelet plugin registration status: plugin_registered:true
        I0505 13:22:15.235357       1 nvlib.go:278] Adding device gpu-0 to allocatable devices
        I0505 13:22:15.236033       1 allocatable.go:243] Adding allocatables for PCI bus ID: 0000:00:04.0
        I0505 13:22:15.605154       1 device_state.go:1367] Warming up CDI device spec cache for GPUs [GPU-8b2035f0-dc4f-bf93-7a32-83e2d13fcec4]
        I0505 13:22:19.114877       1 cdi.go:161] GetDeviceSpecsByID() called for GPU-8b2035f0-dc4f-bf93-7a32-83e2d13fcec4, t_cdi_get_device_specs_by_id 3.510 s
        I0505 13:22:19.115172       1 driver.go:508] About to announce device gpu-0
        I0505 13:22:19.115259       1 driver.go:234] Background device enumeration complete; ResourceSlice republished with populated devices
        I0505 13:22:20.073089       1 driver.go:446] Returning newly prepared devices for claim 'drabasic-3cb498a20ed970517d3c703/sample-dra:0b345f39-971f-4dfc-8025-4790b631c1e4': [{[gpu] gke-pool-cb0e0d56 gpu-0 [k8s.gpu.nvidia.com/claim=0b345f39-971f-4dfc-8025-4790b631c1e4-gpu-0] <nil> <nil>}]

varunrsekar

Partial review. Will do a more thorough pass of the changes over time.

varunrsekar · 2026-05-08T18:42:27Z

+	if !state.AllocatableReady() {
+		driver.wg.Add(1)
+		go driver.backgroundInit(ctx, config)
+	}


There's no reason to do a lazy retry that reimplements driver initialization. Only device enumeration should be retried and Driver shouldn't initialize if device enumeration failed.

it's there because of 3113277117, Primary blocker).

Once we can't block, the post-enumeration steps (MIG cleanup, health monitor, publishResources) have to run after the retry succeeds, which is what backgroundInit does.

If you had something else in mind, let me know 🙏

Thanks for that context. Will review with it in mind.

varunrsekar · 2026-05-08T19:59:55Z

+		}
+		return nil, fmt.Errorf("error enumerating all possible devices: %w", err)
+	}
+	if len(perGPU.allocatablesMap) == 0 {


Here, we're assuming that either ALL GPUs are initialized or NONE of the GPUs are initialized. Is it possible for partial initialization? Do we care about it?

I think we can skip the scenario when some GPUs are not initialized in this pull request to limit the scope of changes it introduces and either add an implementation for it in another pull request or skip it now and add it in the future if needed - I haven't observed the issue with partial iniitialization.

If we want to have this check we can probably check if all GPUs visible as PCI devices are also visible via nvml 🤔

jgehrcke · 2026-05-12T11:34:37Z

I could easily reproduce this state -> successful NVML initialization in the init container, but no GPU discovered in gpu-kubelet-plugin.

That is still somewhat frightening, and we should talk to more people and teams about that.

we're assuming that either ALL GPUs are initialized or NONE of the GPUs are initialized. Is it possible for partial initialization?

That is a really important question.

For posterity, I've found a related discussion (Azure context):
https://learn.microsoft.com/en-us/answers/questions/2285401/tasks-fail-to-detect-gpu-on-some-pool-nodes-due-to

For the scenario where NVML reported at least one GPU in the init container, but zero GPUs in the main container it would be good to confirm explicitly the state of dev nodes in the main container file system. I didn't follow the exchange above in detail, so we may have already done this. (it would be important to confirm if NVML may report zero devices despite the filesystem state looking as expected -- if the filesystem state is unexpected then this may greatly facilitate finding the root cause).

Thanks for the great work here!

varunrsekar

@kasia-kujawa Once you address the changes, please squash the commits

varunrsekar · 2026-05-12T18:15:13Z

+	if len(perGPU.allocatablesMap) == 0 {
+		if checkpointHasPreparedDevices(cp) {
+			klog.Infof("No GPU devices discovered via NVML but the checkpoint has prepared devices, not retrying (unhealthy device state, retry won't help)")
+			return perGPU, nil
+		}
+		klog.Infof("No GPU devices discovered on enumeration attempt; will retry")
+		return nil, nil
+	}


The spirit of my original comment was that we don't need to worry about this differentiation :). If allocatable is empty, we can simply retry. Unhealthy GPUs are a problem but we dont need to solve it here.

Suggested change

if len(perGPU.allocatablesMap) == 0 {

if checkpointHasPreparedDevices(cp) {

klog.Infof("No GPU devices discovered via NVML but the checkpoint has prepared devices, not retrying (unhealthy device state, retry won't help)")

return perGPU, nil

}

klog.Infof("No GPU devices discovered on enumeration attempt; will retry")

return nil, nil

}

if len(perGPU.allocatablesMap) == 0 {

// Caveat: we may end up in this state due to unhealthy GPUs. This needs to be revisited in the future

klog.Infof("No GPU devices discovered on enumeration attempt; will retry")

return nil, nil

}

Hope that this time I understood your idea 😅 @varunrsekar please check it

kasia-kujawa · 2026-05-13T07:52:39Z

I could easily reproduce this state -> successful NVML initialization in the init container, but no GPU discovered in gpu-kubelet-plugin.

That is still somewhat frightening, and we should talk to more people and teams about that.

we're assuming that either ALL GPUs are initialized or NONE of the GPUs are initialized. Is it possible for partial initialization?

That is a really important question.

For posterity, I've found a related discussion (Azure context): https://learn.microsoft.com/en-us/answers/questions/2285401/tasks-fail-to-detect-gpu-on-some-pool-nodes-due-to

For the scenario where NVML reported at least one GPU in the init container, but zero GPUs in the main container it would be good to confirm explicitly the state of dev nodes in the main container file system. I didn't follow the exchange above in detail, so we may have already done this. (it would be important to confirm if NVML may report zero devices despite the filesystem state looking as expected -- if the filesystem state is unexpected then this may greatly facilitate finding the root cause).

Thanks for the great work here!

@jgehrcke some more observations - I noticed that I can easily reproduce this on NVIDIA T4, one/two retries I will see this issue, in the issue that you linked also T4 is mentioned. For example when I tested using P4 I couldn't see the issue.

I was only checking the dev nodes in the init container, trying to see whether this would help us prepare a better init container — it didn’t help :D I can do one more test 🧪 and check the dev nodes in the main container when NVML reports 0 GPUs.

Signed-off-by: Katarzyna Kujawa <katarzyna@cast.ai>

kasia-kujawa · 2026-05-20T11:39:03Z

I could easily reproduce this state -> successful NVML initialization in the init container, but no GPU discovered in gpu-kubelet-plugin.

That is still somewhat frightening, and we should talk to more people and teams about that.

we're assuming that either ALL GPUs are initialized or NONE of the GPUs are initialized. Is it possible for partial initialization?

That is a really important question.
For posterity, I've found a related discussion (Azure context): https://learn.microsoft.com/en-us/answers/questions/2285401/tasks-fail-to-detect-gpu-on-some-pool-nodes-due-to
For the scenario where NVML reported at least one GPU in the init container, but zero GPUs in the main container it would be good to confirm explicitly the state of dev nodes in the main container file system. I didn't follow the exchange above in detail, so we may have already done this. (it would be important to confirm if NVML may report zero devices despite the filesystem state looking as expected -- if the filesystem state is unexpected then this may greatly facilitate finding the root cause).
Thanks for the great work here!

@jgehrcke some more observations - I noticed that I can easily reproduce this on NVIDIA T4, one/two retries I will see this issue, in the issue that you linked also T4 is mentioned. For example when I tested using P4 I couldn't see the issue.

I was only checking the dev nodes in the init container, trying to see whether this would help us prepare a better init container — it didn’t help :D I can do one more test 🧪 and check the dev nodes in the main container when NVML reports 0 GPUs.

@jgehrcke I prepared one more version for debugging and I did the test - nvml found GPU in the additional init container but nvml didn't found GPU in the gpu container but GPU is visible under /dev

the most important logs:
additional gpu init container:

[init] gpu-readiness-init logs:
        I0520 10:40:24.441272       1 main.go:125] using driver library: /driver-root/lib64/libnvidia-ml.so.580.105.08
        I0520 10:40:24.441407       1 main.go:128] using devRoot: /
        I0520 10:40:26.208589       1 ???:1] "WARNING: unable to detect IOMMU FD for [0000:00:04.0 open /sys/bus/pci/devices/0000:00:04.0/vfio-dev: no such file or directory]: %!v(MISSING)"
        I0520 10:40:26.286093       1 main.go:296] found /dev/nvidia<N> node(s): [/dev/nvidia0]
        I0520 10:40:26.286144       1 main.go:215] found 1 GPU(s) via NVML, all /dev/nvidia* device nodes present

gpus container:

        I0520 10:40:28.373573       1 device_state.go:80] Using devRoot=/
        I0520 10:40:28.373822       1 nvlib.go:198] Traverse GPU devices
        I0520 10:40:28.373603       1 prometheus_httpserver.go:78] "Starting metrics HTTP server" endpoint=":8080" path="/metrics"
        W0520 10:40:28.664934       1 device_state.go:240] diagnostic: NVML enumerated 0 GPUs but /dev/nvidia* nodes are present under /dev: [/dev/nvidia-caps (mode=drwxr-xr-x) /dev/nvidia-modeset (mode=Dcrw-rw-rw-) /dev/nvidia-uvm (mode=Dcrw-rw-rw-) /dev/nvidia-uvm-tools (mode=Dcrw-rw-rw-) /dev/nvidia0 (mode=Dcrw-rw-rw-) /dev/nvidiactl (mode=Dcrw-rw-rw-)]
        I0520 10:40:28.664985       1 device_state.go:106] Muting CDI logger (verbosity is smaller 7: 4)
        I0520 10:40:28.832405       1 device_state.go:142] Warming up CDI device spec cache for GPUs []

full logs are here with the reference to the code which I used to check it: kasia-kujawa#9 (comment)

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2026

k8s-ci-robot requested review from klueska and shivamerla April 10, 2026 11:46

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 10, 2026

kasia-kujawa mentioned this pull request Apr 11, 2026

[Bug]: ResourceSlice published with no devices on GKE #1008

Open

16 tasks

varunrsekar reviewed Apr 20, 2026

View reviewed changes

Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated

ArangoGutierrez suggested changes Apr 20, 2026

View reviewed changes

kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from 24e31d3 to 4c4b718 Compare April 29, 2026 09:07

kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from 4c4b718 to 4dba81f Compare May 5, 2026 13:42

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 5, 2026

kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from 4dba81f to 6adf190 Compare May 5, 2026 14:17

kasia-kujawa requested review from ArangoGutierrez and varunrsekar May 5, 2026 16:55

varunrsekar reviewed May 5, 2026

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2026

kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from 6adf190 to a73cfe6 Compare May 8, 2026 15:52

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2026

varunrsekar reviewed May 8, 2026

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2026

kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from db47153 to 58c2ce4 Compare May 11, 2026 13:59

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2026

varunrsekar reviewed May 11, 2026

View reviewed changes

varunrsekar reviewed May 12, 2026

View reviewed changes

shivamerla added this to the v0.4.1 milestone May 13, 2026

github-project-automation Bot added this to DRA Driver for NVIDIA GPUs May 13, 2026

github-project-automation Bot moved this to Backlog in DRA Driver for NVIDIA GPUs May 13, 2026

shivamerla assigned kasia-kujawa May 13, 2026

shivamerla moved this from Backlog to In-Review in DRA Driver for NVIDIA GPUs May 13, 2026

Retry device enumeration to prevent empty ResourceSlices

eac1262

Signed-off-by: Katarzyna Kujawa <katarzyna@cast.ai>

kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from f1603fc to eac1262 Compare May 14, 2026 14:50

kasia-kujawa requested a review from varunrsekar May 19, 2026 09:33

Conversation

kasia-kujawa commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kasia-kujawa commented Apr 21, 2026

Uh oh!

jgehrcke commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasia-kujawa commented Apr 28, 2026

Uh oh!

netlify Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dra-driver-nvidia-gpu ready!

Uh oh!

k8s-ci-robot commented Apr 29, 2026

Uh oh!

kasia-kujawa commented May 5, 2026

Uh oh!

varunrsekar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

varunrsekar May 8, 2026

Choose a reason for hiding this comment

Uh oh!

kasia-kujawa May 11, 2026

Choose a reason for hiding this comment

Uh oh!

varunrsekar May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

varunrsekar May 8, 2026

Choose a reason for hiding this comment

Uh oh!

kasia-kujawa May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgehrcke commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varunrsekar left a comment

Choose a reason for hiding this comment

Uh oh!

varunrsekar May 12, 2026

Choose a reason for hiding this comment

Uh oh!

kasia-kujawa May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kasia-kujawa commented Apr 10, 2026 •

edited

Loading

jgehrcke commented Apr 24, 2026 •

edited

Loading

netlify Bot commented Apr 29, 2026 •

edited

Loading

jgehrcke commented May 12, 2026 •

edited

Loading

kasia-kujawa May 14, 2026 •

edited

Loading

kasia-kujawa commented May 13, 2026 •

edited

Loading

kasia-kujawa commented May 20, 2026 •

edited

Loading