Skip to content

[Bug]: ResourceSlice published with no devices on GKE #1008

@kasia-kujawa

Description

@kasia-kujawa

Component

gpu-kubelet-plugin

Bug Description

On GKE (COS image), I occasionally see a ResourceSlice published with an empty spec.devices:

metadata:
  creationTimestamp: "2026-04-03T11:14:29Z"
  generateName: gke-pool-94048ba1-gpu.nvidia.com-
  generation: 1
  name: gke-pool-94048ba1-gpu.nssw75
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: gke-pool-94048ba1
    uid: 8dece9be-6d55-4886-868a-b32012ad1cec
  resourceVersion: "1775214869927359005"
  uid: a4051477-ef9b-41ab-8151-55196e8d6315
spec:
  driver: gpu.nvidia.com
  nodeName: gke-pool-94048ba1
  pool:
    generation: 1
    name: gke-pool-94048ba1
    resourceSliceCount: 1

No spec.devices field at all. Restarting the nvidia-dra-driver-gpu-kubelet-plugin always fixes it, after restart the ResourceSlice contains the expected GPU devices.

The issue is not easy to reproduce, I captured it once manually, but it also happens from time to time in automated tests.

On GKE COS, the GPU driver is installed by a DaemonSet and it seems that it causes that GPU is not fully initialized when nvidia-dra-driver-gpu-kubelet-plugin starts.

The issue has never appeared on AKS or EKS, which I test in the same way. On AKS and EKS, GPU drivers are installed as part of node preparation, so the node appears with GPU drivers already installed.

Steps to Reproduce

The issue is not easy to reproduce as it happens only sometimes.

  1. Install NVIDIA DRA driver on a GKE cluster with COS image (nvidiaDriverRoot: /home/kubernetes/bin/nvidia/)
  2. Create a new node
  3. Check the ResourceSlice

Expected Behavior

ResourceSlice contains GPU devices in spec.devices.

DRA Driver Version

v25.12.0

Kubernetes Version

1.34.4-gke.1193000

GPU Model

Tesla T4

NVIDIA Driver Version

580.105.08

OS / Kernel

COS, 6.12.55+

Container Runtime

No response

Feature Gates (non-default settings)

No response

Helm Values (non-default)

nvidiaDriverRoot: /home/kubernetes/bin/nvidia/

Relevant Logs

Debug Information Attached

  • kubectl get pods -n nvidia-dra-driver-gpu -o wide
  • kubectl get resourceclaims -n <namespace>
  • kubectl get resourceslices.resource.k8s.io
  • kubectl describe pod <pod> or kubectl events --for pod/<pod>
  • Kubelet plugin logs: kubectl logs -n nvidia-dra-driver-gpu -l nvidia-dra-driver-gpu-component=kubelet-plugin --all-containers --prefix --tail=400
  • nvidia-smi output from the host
  • Kubelet logs

IMEX / ComputeDomain Debug Information (if applicable)

  • Host IMEX service disabled: systemctl status nvidia-imex.service (must be masked/disabled)
  • Node clique labels: kubectl get nodes -L nvidia.com/gpu.clique
  • Per-GPU clique info: nvidia-smi -q | grep -E 'ClusterUUID|CliqueId'
  • NVLink/NVSwitch topology: nvidia-smi topo -m
  • ComputeDomain status: kubectl get computedomains.resource.nvidia.com -o yaml
  • ComputeDomainClique status: kubectl get computedomaincliques.resource.nvidia.com -o yaml
  • IMEX daemon pods: kubectl get pods -n <namespace> -l resource.nvidia.com/computeDomain
  • IMEX daemon logs: kubectl logs -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --all-containers --prefix --tail=-1
  • IMEX domain status: nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg -N

Metadata

Metadata

Assignees

Labels

component/gpu-kubelet-pluginkind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

Type

No type
No fields configured for issues without a type.

Projects

Status
In-Review

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions