Skip to content

Retry device enumeration on startup to prevent empty ResourceSlices#1009

Open
kasia-kujawa wants to merge 1 commit into
kubernetes-sigs:mainfrom
kasia-kujawa:kkujawa_resoruceslice_empty
Open

Retry device enumeration on startup to prevent empty ResourceSlices#1009
kasia-kujawa wants to merge 1 commit into
kubernetes-sigs:mainfrom
kasia-kujawa:kkujawa_resoruceslice_empty

Conversation

@kasia-kujawa
Copy link
Copy Markdown
Contributor

@kasia-kujawa kasia-kujawa commented Apr 10, 2026

Fixes #1008

Added a retry loop in NewDeviceState().
If the first enumeration returns 0 devices, the plugin retries every 5 seconds for up to 5 minutes before proceeding.
Errors still propagate immediately without retry.

Before the fix, no log was emitted after Traverse GPU devices and the empty ResourceSlice was published silently.

With the fix (nvidiaDriverRoot: /home/kubernetes/bin/nvidia/, GKE COS, Tesla T4):

I0410 08:11:16.614833  1 nvlib.go:197] Traverse GPU devices
I0410 08:11:16.779628  1 device_state.go:96] No GPU devices found yet (driver may still be initializing), retrying in 5s...
I0410 08:11:21.780288  1 nvlib.go:197] Traverse GPU devices
I0410 08:11:23.111832  1 nvlib.go:278] Adding device gpu-0 to allocatable devices
I0410 08:11:23.111862  1 allocatable.go:254] Adding allocatables for PCI bus ID: 0000:00:04.0

Full logs with the fix from nvidia-dra-driver-gpu-kubelet-plugin when GPU initialization needed slightly more time:
https://gist.github.com/kasia-kujawa/1082b48357a0ae80d663f12ee665e34c

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2026
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 10, 2026
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Copy link
Copy Markdown
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid approach to the GKE-COS / slow-driver-init race — the real-life log evidence in the description is clean. Two structural concerns plus some cleanups, left inline. Requesting changes primarily on the silent empty-slice publish on timeout and the 5-minute synchronous block on startup; the rest are suggestions.

Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state_test.go Outdated
@kasia-kujawa
Copy link
Copy Markdown
Contributor Author

@ArangoGutierrez @varunrsekar Thanks for the review! ❤️

I'm not sure if I'll have time this week to address your comments, but I definitely will next week.

@jgehrcke
Copy link
Copy Markdown
Contributor

jgehrcke commented Apr 24, 2026

Even if not yet fully understood, #1008 really is an intriguing and interesting problem.

I don't want to undermine any work here, and please consider my feedback as just one item on the feedback shelf. My gut feeling is that if we have to do any retrying that we should do that in the init container, and not in the plugin startup code. Once we understand what would it take to detect that situation over there I think it's also easy to just "wait a little longer".

Such change can be looked at as

  1. just refining the condition to wait for in the init container
  2. not changing the scope of responsibility of any component involved

Having to retry in the plugin startup code feels like the init container doesn't do its job.

@kasia-kujawa
Copy link
Copy Markdown
Contributor Author

Even if not yet fully understood, #1008 really is an intriguing and interesting problem.

I don't want to undermine any work here, and please consider my feedback as just one item on the feedback shelf. My gut feeling is that if we have to do any retrying that we should do that in the init container, and not in the plugin startup code. Once we understand what would it take to detect that situation over there I think it's also easy to just "wait a little longer".

Such change can be looked at as

  1. just refining the condition to wait for in the init container
  2. not changing the scope of responsibility of any component involved

Having to retry in the plugin startup code feels like the init container doesn't do its job.

I initially thought about changes in the init container too but I didn't find good checks to add.
This comment pushed me to think more 😄 and I'll try the approach with checking in the init container if the kernel created the proper number of /dev files (e.g., /dev/nvidia0, /dev/nvidia1).

For the curious, I'm going to test this kasia-kujawa#8

I'll be back with results ⌛ 🧪

@kasia-kujawa kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from 24e31d3 to 4c4b718 Compare April 29, 2026 09:07
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 29, 2026

Deploy Preview for dra-driver-nvidia-gpu ready!

Name Link
🔨 Latest commit eac1262
🔍 Latest deploy log https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/6a05e141997e22000857b7f4
😎 Deploy Preview https://deploy-preview-1009--dra-driver-nvidia-gpu.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kasia-kujawa
Once this PR has been reviewed and has the lgtm label, please assign varunrsekar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kasia-kujawa kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from 4c4b718 to 4dba81f Compare May 5, 2026 13:42
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 5, 2026
@kasia-kujawa
Copy link
Copy Markdown
Contributor Author

@ArangoGutierrez @varunrsekar Please take another look.

I think I addressed all review comments except Scenario 2 mentioned in this comment #1009 (comment)

Scenario 2:

PassthroughSupport featuregate disabled
DRA plugin is being initialized for the first time
nvml returns some but not all GPUs
Decision: ??

I think we can skip this scenario in this pull request to limit the scope of changes it introduces and either add an implementation for it in another pull request or skip it now and add it in the future if needed.

I tried the approach of introducing an init container to check NVML initialization (in Go, in exactly the same way as in the kubelet plugin) but it turned out that successful initialization in the init container does not guarantee successful initialization in another container 😿

I could easily reproduce this state -> successful NVML initialization in the init container, but no GPU discovered in gpu-kubelet-plugin.

It seems that there is an issue either with loading the NVML library or with the communication between the NVML library and the kernel 🤔

Most important logs from my tests with additional init container:

 [init] gpu-readiness-init logs:
        I0430 06:17:14.863067       1 main.go:125] using driver library: /driver-root/lib64/libnvidia-ml.so.580.105.08
        I0430 06:17:14.863175       1 main.go:128] using devRoot: /
        I0430 06:17:16.635580       1 ???:1] "WARNING: unable to detect IOMMU FD for [0000:00:04.0 open /sys/bus/pci/devices/0000:00:04.0/vfio-dev: no such file or directory]: %!v(MISSING)"
        I0430 06:17:16.735256       1 main.go:296] found /dev/nvidia<N> node(s): [/dev/nvidia0]
        I0430 06:17:16.735311       1 main.go:215] found 1 GPU(s) via NVML, all /dev/nvidia* device nodes present
[container] gpus logs:
        I0430 06:17:18.459633       1 utils.go:44] Commit: 5c99e6b2d2ce116e01c3c9ae5c7025fd8efa435e

        Feature gates: map[string]bool{"AllAlpha":false, "AllBeta":false, "ComputeDomainCliques":true, "ContextualLogging":true, "CrashOnNVLinkFabricErrors":true, "DeviceMetadata":false, "DynamicMIG":false, "IMEXDaemonsWithDNSNames":true, "LoggingAlphaOptions":false, "LoggingBetaOptions":true, "MPSSupport":true, "NVMLDeviceHealthCheck":false, "PassthroughSupport":false, "TimeSlicingSettings":true}
        Flags: (*main.Flags)({
          kubeClientConfig: (flags.KubeClientConfig) {
            KubeConfig: (string) "",
            KubeAPIQPS: (float64) 5,
            KubeAPIBurst: (int) 10
          },
          nodeName: (string) (len=53) "gke-e3e-gke-autoscaler-kasia-04-30-cast-pool-306fc077",
          namespace: (string) (len=6) "nvidia",
          httpEndpoint: (string) (len=5) ":8080",
          metricsPath: (string) (len=8) "/metrics",
          cdiRoot: (string) (len=12) "/var/run/cdi",
          containerDriverRoot: (string) (len=12) "/driver-root",
          hostDriverRoot: (string) (len=28) "/home/kubernetes/bin/nvidia/",
          nvidiaCDIHookPath: (string) "",
          imageName: (string) (len=67) "ghcr.io/kasia-kujawa/k8s-dra-driver-gpu:additional-init-container-5",
          kubeletRegistrarDirectoryPath: (string) (len=33) "/var/lib/kubelet/plugins_registry",
          kubeletPluginsDirectoryPath: (string) (len=24) "/var/lib/kubelet/plugins",
          healthcheckPort: (int) 51516,
          klogVerbosity: (int) 4,
          additionalXidsToIgnore: (string) ""
        })
        I0430 06:17:18.460778       1 envvar.go:195] "Feature gate default state" feature="WatchListClient" enabled=true
        I0430 06:17:18.460869       1 envvar.go:195] "Feature gate default state" feature="AtomicFIFO" enabled=true
        I0430 06:17:18.460903       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowTLSCacheGC" enabled=true
        I0430 06:17:18.460928       1 envvar.go:195] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
        I0430 06:17:18.460969       1 envvar.go:195] "Feature gate default state" feature="UnlockWhileProcessingFIFO" enabled=true
        I0430 06:17:18.461000       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCARotation" enabled=true
        I0430 06:17:18.461010       1 envvar.go:195] "Feature gate default state" feature="InformerResourceVersion" enabled=true
        I0430 06:17:18.461015       1 envvar.go:195] "Feature gate default state" feature="InOrderInformers" enabled=true
        I0430 06:17:18.461020       1 envvar.go:195] "Feature gate default state" feature="InOrderInformersBatchProcess" enabled=true
        I0430 06:17:18.461025       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
        I0430 06:17:18.567470       1 util.go:68] Started debug signal handler(s)
        I0430 06:17:18.590249       1 device_state.go:79] Using devRoot=/
        I0430 06:17:18.590271       1 prometheus_httpserver.go:78] "Starting metrics HTTP server" endpoint=":8080" path="/metrics"
        I0430 06:17:18.590460       1 nvlib.go:198] Traverse GPU devices
        I0430 06:17:18.806616       1 device_state.go:97] Muting CDI logger (verbosity is smaller 7: 4)
        I0430 06:17:18.970373       1 device_state.go:133] Warming up CDI device spec cache for GPUs []
        I0430 06:17:18.982118       1 draplugin.go:738] "Starting"
        I0430 06:17:18.982430       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="dra" endpoint="/var/lib/kubelet/plugins/gpu.nvidia.com/dra.sock"
        I0430 06:17:18.982730       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="registrar" endpoint="/var/lib/kubelet/plugins_registry/gpu.nvidia.com-reg.sock"
        I0430 06:17:18.986343       1 resourceslicecontroller.go:619] "Starting ResourceSlice informer and waiting for it to sync" logger="ResourceSlice controller"
        I0430 06:17:18.986461       1 cleanup.go:125] Checkpointed RC cleanup: claims in PrepareStarted state: 0 (of 0)
        I0430 06:17:18.986574       1 health.go:103] starting healthcheck service at [::]:51516
        I0430 06:17:18.991100       1 reflector.go:425] "Starting reflector" logger="ResourceSlice controller" type="*v1.ResourceSlice" resyncPeriod="0s" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0430 06:17:18.991153       1 reflector.go:472] "Listing and watching" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0430 06:17:19.026711       1 reflector.go:1080] "Exiting watch because received the bookmark that marks the end of initial events stream" logger="ResourceSlice controller" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625" totalItems=1 duration="35.361425ms"
        I0430 06:17:19.026816       1 reflector.go:507] "Caches populated" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0430 06:17:19.986673       1 resourceslicecontroller.go:634] "ResourceSlice informer has synced" logger="ResourceSlice controller"
        I0430 06:17:19.986759       1 resourceslicecontroller.go:223] "Starting" logger="ResourceSlice controller"
        I0430 06:17:19.986796       1 driver.go:179] Current kubelet plugin registration status: plugin_registered:true

Full logs from my experiments with additional gpu init container and its implementation is here: kasia-kujawa#9

@varunrsekar Could you make sure if this pull request doesn't break anything in PassthroughSupport? I don't have any machine on which I can test it.

Full logs from the version with the fix introduced in this pull request when the first device discovery fails:

[init] init-container logs:
        create symlink: /driver-root -> /driver-root-parent/nvidia
        2026-05-05T13:21:57Z  /driver-root (/home/kubernetes/bin/nvidia/ on host): nvidia-smi: '/driver-root/bin/nvidia-smi', libnvidia-ml.so.1: not found, current contents: [NVIDIA-Linux-x86_64-580.105.08.run
        bin
        bin-workdir
        drivers
        drivers-workdir
        firmware
        gpu_driver_versions.bin
        lib64
        lib64-workdir
        nvidia-drivers-580.105.08.tgz
        nvidia-installer.log
        share
        vulkan].
        
        Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/home/kubernetes/bin/nvidia/') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.
        Hint: Directory /home/kubernetes/bin/nvidia/ is not empty but at least one of the binaries wasn't found.
        
        2026-05-05T13:22:07Z  /driver-root (/home/kubernetes/bin/nvidia/ on host): nvidia-smi: '/driver-root/bin/nvidia-smi', libnvidia-ml.so.1: '/driver-root/lib64/libnvidia-ml.so.1', current contents: [.cache
        NVIDIA-Linux-x86_64-580.105.08.run
        bin
        bin-workdir
        drivers
        drivers-workdir
        firmware
        gpu_driver_versions.bin
        lib64
        lib64-workdir
        nvidia-drivers-580.105.08.tgz
        nvidia-installer.log
        share
        vulkan].
        invoke: env -i LD_PRELOAD=/driver-root/lib64/libnvidia-ml.so.1 /driver-root/bin/nvidia-smi --version
        NVIDIA-SMI version  : 580.105.08
        NVML version        : 580.105
        DRIVER version      : 580.105.08
        CUDA Version        : N/A
        nvidia-smi returned with code 0: success, leave
[container] compute-domains logs:
        I0505 13:22:11.706484       1 utils.go:44] Commit: 0b6aefb41e3742b02e0dc9133f649e552b2b742d
        
        Feature gates: map[string]bool{"AllAlpha":false, "AllBeta":false, "ComputeDomainCliques":true, "ContextualLogging":true, "CrashOnNVLinkFabricErrors":true, "DeviceMetadata":false, "DynamicMIG":false, "IMEXDaemonsWithDNSNames":true, "LoggingAlphaOptions":false, "LoggingBetaOptions":true, "MPSSupport":false, "NVMLDeviceHealthCheck":false, "PassthroughSupport":false, "TimeSlicingSettings":false}
        Flags: (*main.Flags)({
          kubeClientConfig: (flags.KubeClientConfig) {
            KubeConfig: (string) "",
            KubeAPIQPS: (float64) 5,
            KubeAPIBurst: (int) 10
          },
          nodeName: (string) (len=53) "gke-pool-cb0e0d56",
          httpEndpoint: (string) "",
          metricsPath: (string) (len=8) "/metrics",
          namespace: (string) (len=6) "nvidia",
          cdiRoot: (string) (len=12) "/var/run/cdi",
          containerDriverRoot: (string) (len=12) "/driver-root",
          hostDriverRoot: (string) (len=28) "/home/kubernetes/bin/nvidia/",
          nvidiaCDIHookPath: (string) "",
          kubeletRegistrarDirectoryPath: (string) (len=33) "/var/lib/kubelet/plugins_registry",
          kubeletPluginsDirectoryPath: (string) (len=24) "/var/lib/kubelet/plugins",
          healthcheckPort: (int) 51515,
          klogVerbosity: (int) 4
        })
        I0505 13:22:11.710332       1 envvar.go:195] "Feature gate default state" feature="AtomicFIFO" enabled=true
        I0505 13:22:11.710462       1 envvar.go:195] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
        I0505 13:22:11.710476       1 envvar.go:195] "Feature gate default state" feature="InOrderInformers" enabled=true
        I0505 13:22:11.710484       1 envvar.go:195] "Feature gate default state" feature="InformerResourceVersion" enabled=true
        I0505 13:22:11.710490       1 envvar.go:195] "Feature gate default state" feature="UnlockWhileProcessingFIFO" enabled=true
        I0505 13:22:11.710495       1 envvar.go:195] "Feature gate default state" feature="InOrderInformersBatchProcess" enabled=true
        I0505 13:22:11.710501       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCARotation" enabled=true
        I0505 13:22:11.710751       1 envvar.go:195] "Feature gate default state" feature="WatchListClient" enabled=true
        I0505 13:22:11.710768       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
        I0505 13:22:11.710774       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowTLSCacheGC" enabled=true
        I0505 13:22:11.785895       1 util.go:68] Started debug signal handler(s)
        I0505 13:22:12.541434       1 mount_linux.go:326] 'umount /tmp/kubelet-detect-safe-umount3962003754' failed with: exit status 1, output: umount: can't unmount /tmp/kubelet-detect-safe-umount3962003754: Invalid argument
        I0505 13:22:12.541504       1 mount_linux.go:328] Detected umount with unsafe 'not mounted' behavior
        I0505 13:22:12.543999       1 device_state.go:696] Starting driver version validation for IMEXDaemonsWithDNSNames feature...
        I0505 13:22:12.544027       1 device_state.go:697] Minimum required version: 570.158.01
        I0505 13:22:12.785184       1 device_state.go:715] Driver version validation passed: 580.105.8 >= 570.158.1
        I0505 13:22:12.787686       1 device_state.go:84] using devRoot=/
        ERROR: init 250 result=11ERROR: init 250 result=11I0505 13:22:13.011219       1 device_state.go:146] Create empty checkpoint
        I0505 13:22:13.115488       1 draplugin.go:738] "Starting"
        I0505 13:22:13.155986       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="dra" endpoint="/var/lib/kubelet/plugins/compute-domain.nvidia.com/dra.sock"
        I0505 13:22:13.156195       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="registrar" endpoint="/var/lib/kubelet/plugins_registry/compute-domain.nvidia.com-reg.sock"
        I0505 13:22:13.172278       1 reflector.go:425] "Starting reflector" type="*v1beta1.ComputeDomain" resyncPeriod="10m0s" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141"
        I0505 13:22:13.172351       1 reflector.go:472] "Listing and watching" type="*v1beta1.ComputeDomain" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141"
        I0505 13:22:13.242278       1 reflector.go:1080] "Exiting watch because received the bookmark that marks the end of initial events stream" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141" totalItems=1 duration="69.876632ms"
        I0505 13:22:13.242388       1 reflector.go:507] "Caches populated" type="*v1beta1.ComputeDomain" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141"
        I0505 13:22:13.264404       1 resourceslicecontroller.go:619] "Starting ResourceSlice informer and waiting for it to sync" logger="ResourceSlice controller"
        I0505 13:22:13.264418       1 health.go:102] Starting healthcheck server on [::]:51515
        I0505 13:22:13.264511       1 reflector.go:425] "Starting reflector" logger="ResourceSlice controller" type="*v1.ResourceSlice" resyncPeriod="0s" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:13.264537       1 reflector.go:472] "Listing and watching" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:13.280332       1 cleanup.go:125] Checkpointed RC cleanup: claims in PrepareStarted state: 0 (of 0)
        I0505 13:22:13.289063       1 reflector.go:1080] "Exiting watch because received the bookmark that marks the end of initial events stream" logger="ResourceSlice controller" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625" totalItems=1 duration="24.487185ms"
        I0505 13:22:13.289185       1 reflector.go:507] "Caches populated" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:14.265278       1 resourceslicecontroller.go:634] "ResourceSlice informer has synced" logger="ResourceSlice controller"
        I0505 13:22:14.265436       1 resourceslicecontroller.go:223] "Starting" logger="ResourceSlice controller"
[container] gpus logs:
        I0505 13:22:11.948660       1 utils.go:44] Commit: 0b6aefb41e3742b02e0dc9133f649e552b2b742d
        
        Feature gates: map[string]bool{"AllAlpha":false, "AllBeta":false, "ComputeDomainCliques":true, "ContextualLogging":true, "CrashOnNVLinkFabricErrors":true, "DeviceMetadata":false, "DynamicMIG":false, "IMEXDaemonsWithDNSNames":true, "LoggingAlphaOptions":false, "LoggingBetaOptions":true, "MPSSupport":false, "NVMLDeviceHealthCheck":false, "PassthroughSupport":false, "TimeSlicingSettings":false}
        Flags: (*main.Flags)({
          kubeClientConfig: (flags.KubeClientConfig) {
            KubeConfig: (string) "",
            KubeAPIQPS: (float64) 5,
            KubeAPIBurst: (int) 10
          },
          nodeName: (string) (len=53) "gke-pool-cb0e0d56",
          namespace: (string) (len=6) "nvidia",
          httpEndpoint: (string) "",
          metricsPath: (string) (len=8) "/metrics",
          cdiRoot: (string) (len=12) "/var/run/cdi",
          containerDriverRoot: (string) (len=12) "/driver-root",
          hostDriverRoot: (string) (len=28) "/home/kubernetes/bin/nvidia/",
          nvidiaCDIHookPath: (string) "",
          imageName: (string) (len=63) "ghcr.io/kasia-kujawa/k8s-dra-driver-gpu:retries-in-background-1",
          kubeletRegistrarDirectoryPath: (string) (len=33) "/var/lib/kubelet/plugins_registry",
          kubeletPluginsDirectoryPath: (string) (len=24) "/var/lib/kubelet/plugins",
          healthcheckPort: (int) 51516,
          klogVerbosity: (int) 4,
          additionalXidsToIgnore: (string) "",
          deviceEnumerationRetrySteps: (int) 15,
          deviceEnumerationRetryMaxInterval: (time.Duration) 30000000000
        })
        I0505 13:22:11.950438       1 envvar.go:195] "Feature gate default state" feature="InformerResourceVersion" enabled=true
        I0505 13:22:11.950547       1 envvar.go:195] "Feature gate default state" feature="AtomicFIFO" enabled=true
        I0505 13:22:11.950594       1 envvar.go:195] "Feature gate default state" feature="InOrderInformers" enabled=true
        I0505 13:22:11.950667       1 envvar.go:195] "Feature gate default state" feature="InOrderInformersBatchProcess" enabled=true
        I0505 13:22:11.950704       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCARotation" enabled=true
        I0505 13:22:11.950758       1 envvar.go:195] "Feature gate default state" feature="UnlockWhileProcessingFIFO" enabled=true
        I0505 13:22:11.950792       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowTLSCacheGC" enabled=true
        I0505 13:22:11.950845       1 envvar.go:195] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
        I0505 13:22:11.950909       1 envvar.go:195] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
        I0505 13:22:11.950969       1 envvar.go:195] "Feature gate default state" feature="WatchListClient" enabled=true
        I0505 13:22:11.984215       1 util.go:68] Started debug signal handler(s)
        I0505 13:22:12.041700       1 device_state.go:94] Using devRoot=/
        I0505 13:22:12.041867       1 device_state.go:107] Muting CDI logger (verbosity is smaller 7: 4)
        I0505 13:22:12.497269       1 nvlib.go:197] Traverse GPU devices
        I0505 13:22:12.664423       1 device_state.go:1321] No GPU devices discovered on enumeration attempt; will retry in background
        I0505 13:22:12.664631       1 draplugin.go:738] "Starting"
        I0505 13:22:12.783646       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="dra" endpoint="/var/lib/kubelet/plugins/gpu.nvidia.com/dra.sock"
        I0505 13:22:12.784609       1 nonblockinggrpcserver.go:90] "GRPC server started" logger="registrar" endpoint="/var/lib/kubelet/plugins_registry/gpu.nvidia.com-reg.sock"
        I0505 13:22:12.817241       1 health.go:103] starting healthcheck service at [::]:51516
        I0505 13:22:12.841200       1 resourceslicecontroller.go:619] "Starting ResourceSlice informer and waiting for it to sync" logger="ResourceSlice controller"
        I0505 13:22:12.842138       1 reflector.go:425] "Starting reflector" logger="ResourceSlice controller" type="*v1.ResourceSlice" resyncPeriod="0s" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:12.843037       1 reflector.go:472] "Listing and watching" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:12.878669       1 reflector.go:1080] "Exiting watch because received the bookmark that marks the end of initial events stream" logger="ResourceSlice controller" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625" totalItems=1 duration="35.558419ms"
        I0505 13:22:12.878812       1 reflector.go:507] "Caches populated" logger="ResourceSlice controller" type="*v1.ResourceSlice" reflector="k8s.io/dynamic-resource-allocation/resourceslice/resourceslicecontroller.go:625"
        I0505 13:22:12.940780       1 cleanup.go:125] Checkpointed RC cleanup: claims in PrepareStarted state: 0 (of 0)
        I0505 13:22:13.842114       1 resourceslicecontroller.go:634] "ResourceSlice informer has synced" logger="ResourceSlice controller"
        I0505 13:22:13.842169       1 resourceslicecontroller.go:223] "Starting" logger="ResourceSlice controller"
        I0505 13:22:13.842696       1 nvlib.go:197] Traverse GPU devices
        I0505 13:22:13.842757       1 driver.go:184] Current kubelet plugin registration status: plugin_registered:true
        I0505 13:22:15.235357       1 nvlib.go:278] Adding device gpu-0 to allocatable devices
        I0505 13:22:15.236033       1 allocatable.go:243] Adding allocatables for PCI bus ID: 0000:00:04.0
        I0505 13:22:15.605154       1 device_state.go:1367] Warming up CDI device spec cache for GPUs [GPU-8b2035f0-dc4f-bf93-7a32-83e2d13fcec4]
        I0505 13:22:19.114877       1 cdi.go:161] GetDeviceSpecsByID() called for GPU-8b2035f0-dc4f-bf93-7a32-83e2d13fcec4, t_cdi_get_device_specs_by_id 3.510 s
        I0505 13:22:19.115172       1 driver.go:508] About to announce device gpu-0
        I0505 13:22:19.115259       1 driver.go:234] Background device enumeration complete; ResourceSlice republished with populated devices
        I0505 13:22:20.073089       1 driver.go:446] Returning newly prepared devices for claim 'drabasic-3cb498a20ed970517d3c703/sample-dra:0b345f39-971f-4dfc-8025-4790b631c1e4': [{[gpu] gke-pool-cb0e0d56 gpu-0 [k8s.gpu.nvidia.com/claim=0b345f39-971f-4dfc-8025-4790b631c1e4-gpu-0] <nil> <nil>}]

@kasia-kujawa kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from 4dba81f to 6adf190 Compare May 5, 2026 14:17
Copy link
Copy Markdown
Contributor

@varunrsekar varunrsekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. Will do a more thorough pass of the changes over time.

Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/driver.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2026
@kasia-kujawa kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from 6adf190 to a73cfe6 Compare May 8, 2026 15:52
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2026
Comment thread cmd/gpu-kubelet-plugin/device_state.go
Comment on lines +181 to +184
if !state.AllocatableReady() {
driver.wg.Add(1)
go driver.backgroundInit(ctx, config)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reason to do a lazy retry that reimplements driver initialization. Only device enumeration should be retried and Driver shouldn't initialize if device enumeration failed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's there because of 3113277117, Primary blocker).

Once we can't block, the post-enumeration steps (MIG cleanup, health monitor, publishResources) have to run after the retry succeeds, which is what backgroundInit does.

If you had something else in mind, let me know 🙏

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that context. Will review with it in mind.

Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
}
return nil, fmt.Errorf("error enumerating all possible devices: %w", err)
}
if len(perGPU.allocatablesMap) == 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we're assuming that either ALL GPUs are initialized or NONE of the GPUs are initialized. Is it possible for partial initialization? Do we care about it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip the scenario when some GPUs are not initialized in this pull request to limit the scope of changes it introduces and either add an implementation for it in another pull request or skip it now and add it in the future if needed - I haven't observed the issue with partial iniitialization.

If we want to have this check we can probably check if all GPUs visible as PCI devices are also visible via nvml 🤔

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2026
@kasia-kujawa kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from db47153 to 58c2ce4 Compare May 11, 2026 13:59
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2026
Comment thread cmd/gpu-kubelet-plugin/device_enumerator.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_enumerator.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_enumerator.go
Comment thread cmd/gpu-kubelet-plugin/device_state.go
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread deployments/helm/dra-driver-nvidia-gpu/values.yaml Outdated
@jgehrcke
Copy link
Copy Markdown
Contributor

jgehrcke commented May 12, 2026

I could easily reproduce this state -> successful NVML initialization in the init container, but no GPU discovered in gpu-kubelet-plugin.

That is still somewhat frightening, and we should talk to more people and teams about that.

we're assuming that either ALL GPUs are initialized or NONE of the GPUs are initialized. Is it possible for partial initialization?

That is a really important question.

For posterity, I've found a related discussion (Azure context):
https://learn.microsoft.com/en-us/answers/questions/2285401/tasks-fail-to-detect-gpu-on-some-pool-nodes-due-to

For the scenario where NVML reported at least one GPU in the init container, but zero GPUs in the main container it would be good to confirm explicitly the state of dev nodes in the main container file system. I didn't follow the exchange above in detail, so we may have already done this. (it would be important to confirm if NVML may report zero devices despite the filesystem state looking as expected -- if the filesystem state is unexpected then this may greatly facilitate finding the root cause).

Thanks for the great work here!

Copy link
Copy Markdown
Contributor

@varunrsekar varunrsekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kasia-kujawa Once you address the changes, please squash the commits

Comment on lines +59 to +66
if len(perGPU.allocatablesMap) == 0 {
if checkpointHasPreparedDevices(cp) {
klog.Infof("No GPU devices discovered via NVML but the checkpoint has prepared devices, not retrying (unhealthy device state, retry won't help)")
return perGPU, nil
}
klog.Infof("No GPU devices discovered on enumeration attempt; will retry")
return nil, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spirit of my original comment was that we don't need to worry about this differentiation :). If allocatable is empty, we can simply retry. Unhealthy GPUs are a problem but we dont need to solve it here.

Suggested change
if len(perGPU.allocatablesMap) == 0 {
if checkpointHasPreparedDevices(cp) {
klog.Infof("No GPU devices discovered via NVML but the checkpoint has prepared devices, not retrying (unhealthy device state, retry won't help)")
return perGPU, nil
}
klog.Infof("No GPU devices discovered on enumeration attempt; will retry")
return nil, nil
}
if len(perGPU.allocatablesMap) == 0 {
// Caveat: we may end up in this state due to unhealthy GPUs. This needs to be revisited in the future
klog.Infof("No GPU devices discovered on enumeration attempt; will retry")
return nil, nil
}

Copy link
Copy Markdown
Contributor Author

@kasia-kujawa kasia-kujawa May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope that this time I understood your idea 😅 @varunrsekar please check it

Comment thread cmd/gpu-kubelet-plugin/device_enumerator.go Outdated
Comment thread cmd/gpu-kubelet-plugin/driver.go
Comment thread cmd/gpu-kubelet-plugin/device_enumerator.go
Comment thread cmd/gpu-kubelet-plugin/device_enumerator.go Outdated
@kasia-kujawa
Copy link
Copy Markdown
Contributor Author

kasia-kujawa commented May 13, 2026

I could easily reproduce this state -> successful NVML initialization in the init container, but no GPU discovered in gpu-kubelet-plugin.

That is still somewhat frightening, and we should talk to more people and teams about that.

we're assuming that either ALL GPUs are initialized or NONE of the GPUs are initialized. Is it possible for partial initialization?

That is a really important question.

For posterity, I've found a related discussion (Azure context): https://learn.microsoft.com/en-us/answers/questions/2285401/tasks-fail-to-detect-gpu-on-some-pool-nodes-due-to

For the scenario where NVML reported at least one GPU in the init container, but zero GPUs in the main container it would be good to confirm explicitly the state of dev nodes in the main container file system. I didn't follow the exchange above in detail, so we may have already done this. (it would be important to confirm if NVML may report zero devices despite the filesystem state looking as expected -- if the filesystem state is unexpected then this may greatly facilitate finding the root cause).

Thanks for the great work here!

@jgehrcke some more observations - I noticed that I can easily reproduce this on NVIDIA T4, one/two retries I will see this issue, in the issue that you linked also T4 is mentioned. For example when I tested using P4 I couldn't see the issue.

I was only checking the dev nodes in the init container, trying to see whether this would help us prepare a better init container — it didn’t help :D I can do one more test 🧪 and check the dev nodes in the main container when NVML reports 0 GPUs.

Signed-off-by: Katarzyna Kujawa <katarzyna@cast.ai>
@kasia-kujawa kasia-kujawa force-pushed the kkujawa_resoruceslice_empty branch from f1603fc to eac1262 Compare May 14, 2026 14:50
@kasia-kujawa kasia-kujawa requested a review from varunrsekar May 19, 2026 09:33
@kasia-kujawa
Copy link
Copy Markdown
Contributor Author

kasia-kujawa commented May 20, 2026

I could easily reproduce this state -> successful NVML initialization in the init container, but no GPU discovered in gpu-kubelet-plugin.

That is still somewhat frightening, and we should talk to more people and teams about that.

we're assuming that either ALL GPUs are initialized or NONE of the GPUs are initialized. Is it possible for partial initialization?

That is a really important question.
For posterity, I've found a related discussion (Azure context): https://learn.microsoft.com/en-us/answers/questions/2285401/tasks-fail-to-detect-gpu-on-some-pool-nodes-due-to
For the scenario where NVML reported at least one GPU in the init container, but zero GPUs in the main container it would be good to confirm explicitly the state of dev nodes in the main container file system. I didn't follow the exchange above in detail, so we may have already done this. (it would be important to confirm if NVML may report zero devices despite the filesystem state looking as expected -- if the filesystem state is unexpected then this may greatly facilitate finding the root cause).
Thanks for the great work here!

@jgehrcke some more observations - I noticed that I can easily reproduce this on NVIDIA T4, one/two retries I will see this issue, in the issue that you linked also T4 is mentioned. For example when I tested using P4 I couldn't see the issue.

I was only checking the dev nodes in the init container, trying to see whether this would help us prepare a better init container — it didn’t help :D I can do one more test 🧪 and check the dev nodes in the main container when NVML reports 0 GPUs.

@jgehrcke I prepared one more version for debugging and I did the test - nvml found GPU in the additional init container but nvml didn't found GPU in the gpu container but GPU is visible under /dev

the most important logs:
additional gpu init container:

[init] gpu-readiness-init logs:
        I0520 10:40:24.441272       1 main.go:125] using driver library: /driver-root/lib64/libnvidia-ml.so.580.105.08
        I0520 10:40:24.441407       1 main.go:128] using devRoot: /
        I0520 10:40:26.208589       1 ???:1] "WARNING: unable to detect IOMMU FD for [0000:00:04.0 open /sys/bus/pci/devices/0000:00:04.0/vfio-dev: no such file or directory]: %!v(MISSING)"
        I0520 10:40:26.286093       1 main.go:296] found /dev/nvidia<N> node(s): [/dev/nvidia0]
        I0520 10:40:26.286144       1 main.go:215] found 1 GPU(s) via NVML, all /dev/nvidia* device nodes present

gpus container:

        I0520 10:40:28.373573       1 device_state.go:80] Using devRoot=/
        I0520 10:40:28.373822       1 nvlib.go:198] Traverse GPU devices
        I0520 10:40:28.373603       1 prometheus_httpserver.go:78] "Starting metrics HTTP server" endpoint=":8080" path="/metrics"
        W0520 10:40:28.664934       1 device_state.go:240] diagnostic: NVML enumerated 0 GPUs but /dev/nvidia* nodes are present under /dev: [/dev/nvidia-caps (mode=drwxr-xr-x) /dev/nvidia-modeset (mode=Dcrw-rw-rw-) /dev/nvidia-uvm (mode=Dcrw-rw-rw-) /dev/nvidia-uvm-tools (mode=Dcrw-rw-rw-) /dev/nvidia0 (mode=Dcrw-rw-rw-) /dev/nvidiactl (mode=Dcrw-rw-rw-)]
        I0520 10:40:28.664985       1 device_state.go:106] Muting CDI logger (verbosity is smaller 7: 4)
        I0520 10:40:28.832405       1 device_state.go:142] Warming up CDI device spec cache for GPUs []

full logs are here with the reference to the code which I used to check it: kasia-kujawa#9 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: In-Review

Development

Successfully merging this pull request may close these issues.

[Bug]: ResourceSlice published with no devices on GKE

6 participants