Component
gpu-kubelet-plugin
Bug Description
On GKE (COS image), I occasionally see a ResourceSlice published with an empty spec.devices:
metadata:
creationTimestamp: "2026-04-03T11:14:29Z"
generateName: gke-pool-94048ba1-gpu.nvidia.com-
generation: 1
name: gke-pool-94048ba1-gpu.nssw75
ownerReferences:
- apiVersion: v1
controller: true
kind: Node
name: gke-pool-94048ba1
uid: 8dece9be-6d55-4886-868a-b32012ad1cec
resourceVersion: "1775214869927359005"
uid: a4051477-ef9b-41ab-8151-55196e8d6315
spec:
driver: gpu.nvidia.com
nodeName: gke-pool-94048ba1
pool:
generation: 1
name: gke-pool-94048ba1
resourceSliceCount: 1
No spec.devices field at all. Restarting the nvidia-dra-driver-gpu-kubelet-plugin always fixes it, after restart the ResourceSlice contains the expected GPU devices.
The issue is not easy to reproduce, I captured it once manually, but it also happens from time to time in automated tests.
On GKE COS, the GPU driver is installed by a DaemonSet and it seems that it causes that GPU is not fully initialized when nvidia-dra-driver-gpu-kubelet-plugin starts.
The issue has never appeared on AKS or EKS, which I test in the same way. On AKS and EKS, GPU drivers are installed as part of node preparation, so the node appears with GPU drivers already installed.
Steps to Reproduce
The issue is not easy to reproduce as it happens only sometimes.
- Install NVIDIA DRA driver on a GKE cluster with COS image (
nvidiaDriverRoot: /home/kubernetes/bin/nvidia/)
- Create a new node
- Check the
ResourceSlice
Expected Behavior
ResourceSlice contains GPU devices in spec.devices.
DRA Driver Version
v25.12.0
Kubernetes Version
1.34.4-gke.1193000
GPU Model
Tesla T4
NVIDIA Driver Version
580.105.08
OS / Kernel
COS, 6.12.55+
Container Runtime
No response
Feature Gates (non-default settings)
No response
Helm Values (non-default)
nvidiaDriverRoot: /home/kubernetes/bin/nvidia/
Relevant Logs
Debug Information Attached
IMEX / ComputeDomain Debug Information (if applicable)
Component
gpu-kubelet-plugin
Bug Description
On GKE (COS image), I occasionally see a
ResourceSlicepublished with an emptyspec.devices:No
spec.devicesfield at all. Restarting thenvidia-dra-driver-gpu-kubelet-pluginalways fixes it, after restart the ResourceSlice contains the expected GPU devices.The issue is not easy to reproduce, I captured it once manually, but it also happens from time to time in automated tests.
On GKE COS, the GPU driver is installed by a DaemonSet and it seems that it causes that GPU is not fully initialized when
nvidia-dra-driver-gpu-kubelet-pluginstarts.The issue has never appeared on AKS or EKS, which I test in the same way. On AKS and EKS, GPU drivers are installed as part of node preparation, so the node appears with GPU drivers already installed.
Steps to Reproduce
The issue is not easy to reproduce as it happens only sometimes.
nvidiaDriverRoot: /home/kubernetes/bin/nvidia/)ResourceSliceExpected Behavior
ResourceSlicecontains GPU devices inspec.devices.DRA Driver Version
v25.12.0
Kubernetes Version
1.34.4-gke.1193000
GPU Model
Tesla T4
NVIDIA Driver Version
580.105.08
OS / Kernel
COS, 6.12.55+
Container Runtime
No response
Feature Gates (non-default settings)
No response
Helm Values (non-default)
Relevant Logs
Debug Information Attached
kubectl get pods -n nvidia-dra-driver-gpu -o widekubectl get resourceclaims -n <namespace>kubectl get resourceslices.resource.k8s.iokubectl describe pod <pod>orkubectl events --for pod/<pod>kubectl logs -n nvidia-dra-driver-gpu -l nvidia-dra-driver-gpu-component=kubelet-plugin --all-containers --prefix --tail=400nvidia-smioutput from the hostIMEX / ComputeDomain Debug Information (if applicable)
systemctl status nvidia-imex.service(must be masked/disabled)kubectl get nodes -L nvidia.com/gpu.cliquenvidia-smi -q | grep -E 'ClusterUUID|CliqueId'nvidia-smi topo -mkubectl get computedomains.resource.nvidia.com -o yamlkubectl get computedomaincliques.resource.nvidia.com -o yamlkubectl get pods -n <namespace> -l resource.nvidia.com/computeDomainkubectl logs -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --all-containers --prefix --tail=-1nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg -N