[Bug]: ResourceSlice published with no devices on GKE

### Component

gpu-kubelet-plugin

### Bug Description

On GKE (COS image), I occasionally see a `ResourceSlice` published with an empty `spec.devices`:

```yaml
metadata:
  creationTimestamp: "2026-04-03T11:14:29Z"
  generateName: gke-pool-94048ba1-gpu.nvidia.com-
  generation: 1
  name: gke-pool-94048ba1-gpu.nssw75
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: gke-pool-94048ba1
    uid: 8dece9be-6d55-4886-868a-b32012ad1cec
  resourceVersion: "1775214869927359005"
  uid: a4051477-ef9b-41ab-8151-55196e8d6315
spec:
  driver: gpu.nvidia.com
  nodeName: gke-pool-94048ba1
  pool:
    generation: 1
    name: gke-pool-94048ba1
    resourceSliceCount: 1
```

No `spec.devices` field at all. Restarting the `nvidia-dra-driver-gpu-kubelet-plugin` always fixes it, after restart the ResourceSlice contains the expected GPU devices.

The issue is not easy to reproduce, I captured it once manually,  but it also happens from time to time in automated tests. 

On GKE COS, the GPU driver is installed by a DaemonSet and it seems that it causes that GPU is not fully initialized when `nvidia-dra-driver-gpu-kubelet-plugin` starts.

The issue has never appeared on AKS or EKS, which I test in the same way. On AKS and EKS, GPU drivers are installed as part of node preparation, so the node appears with GPU drivers already installed.


### Steps to Reproduce

 The issue is not easy to reproduce as it happens only sometimes.

1. Install NVIDIA DRA driver on a GKE cluster with COS image (`nvidiaDriverRoot: /home/kubernetes/bin/nvidia/`)
2. Create a new node
3. Check the `ResourceSlice`

### Expected Behavior

`ResourceSlice` contains GPU devices in `spec.devices`.

### DRA Driver Version

v25.12.0

### Kubernetes Version

1.34.4-gke.1193000

### GPU Model

Tesla T4

### NVIDIA Driver Version

580.105.08

### OS / Kernel

COS, 6.12.55+

### Container Runtime

_No response_

### Feature Gates (non-default settings)

_No response_

### Helm Values (non-default)

```yaml
nvidiaDriverRoot: /home/kubernetes/bin/nvidia/
```

### Relevant Logs

```shell

```

### Debug Information Attached

- [ ] `kubectl get pods -n nvidia-dra-driver-gpu -o wide`
- [ ] `kubectl get resourceclaims -n <namespace>`
- [ ] `kubectl get resourceslices.resource.k8s.io`
- [ ] `kubectl describe pod <pod>` or `kubectl events --for pod/<pod>`
- [ ] Kubelet plugin logs: `kubectl logs -n nvidia-dra-driver-gpu -l nvidia-dra-driver-gpu-component=kubelet-plugin --all-containers --prefix --tail=400`
- [ ] `nvidia-smi` output from the host
- [ ] Kubelet logs

### IMEX / ComputeDomain Debug Information (if applicable)

- [ ] Host IMEX service disabled: `systemctl status nvidia-imex.service` (must be masked/disabled)
- [ ] Node clique labels: `kubectl get nodes -L nvidia.com/gpu.clique`
- [ ] Per-GPU clique info: `nvidia-smi -q | grep -E 'ClusterUUID|CliqueId'`
- [ ] NVLink/NVSwitch topology: `nvidia-smi topo -m`
- [ ] ComputeDomain status: `kubectl get computedomains.resource.nvidia.com -o yaml`
- [ ] ComputeDomainClique status: `kubectl get computedomaincliques.resource.nvidia.com -o yaml`
- [ ] IMEX daemon pods: `kubectl get pods -n <namespace> -l resource.nvidia.com/computeDomain`
- [ ] IMEX daemon logs: `kubectl logs -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --all-containers --prefix --tail=-1`
- [ ] IMEX domain status: `nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg -N`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: ResourceSlice published with no devices on GKE #1008

Component

Bug Description

Steps to Reproduce

Expected Behavior

DRA Driver Version

Kubernetes Version

GPU Model

NVIDIA Driver Version

OS / Kernel

Container Runtime

Feature Gates (non-default settings)

Helm Values (non-default)

Relevant Logs

Debug Information Attached

IMEX / ComputeDomain Debug Information (if applicable)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: ResourceSlice published with no devices on GKE #1008

Description

Component

Bug Description

Steps to Reproduce

Expected Behavior

DRA Driver Version

Kubernetes Version

GPU Model

NVIDIA Driver Version

OS / Kernel

Container Runtime

Feature Gates (non-default settings)

Helm Values (non-default)

Relevant Logs

Debug Information Attached

IMEX / ComputeDomain Debug Information (if applicable)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions