Skip to content

[Bug]: MPS DaemonSet missing imagePullSecrets causes ImagePullBackOff on Kubernetes 1.35+ #1045

@shengnuo

Description

@shengnuo

Component

gpu-kubelet-plugin

Bug Description

See #958 for details.

The issue also exists for MPS daemonsets

Steps to Reproduce

  1. Deploy GPU DRA using an image from a private repo
  2. Create a ResourceClaim with MPS sharing strategy
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
 name: gpu-mps-app
spec:
 devices:
   requests:
   - name: mps-gpu
     exactly:
       deviceClassName: gpu.nvidia.com
   config:
   - requests: ["mps-gpu"]
     opaque:
       driver: gpu.nvidia.com
       parameters:
         apiVersion: resource.nvidia.com/v1beta1
         kind: GpuConfig
         sharing:
           strategy: MPS
           mpsConfig:
             defaultActiveThreadPercentage: 33
             defaultPinnedDeviceMemoryLimit: 5Gi
  1. Create a Kubernetes resource to consume the ResourceClaim
  2. Observe that MPS daemonset is stuck in ImagePullBackOff

Expected Behavior

The MPS daemonset should pull the private image using the same image pull secret to create rest of the DRA components

DRA Driver Version

v25.12.0

Kubernetes Version

k8s v1.35v

GPU Model

NVIDIA A100

NVIDIA Driver Version

580.105.08

OS / Kernel

No response

Container Runtime

No response

Feature Gates (non-default settings)

No response

Helm Values (non-default)

Relevant Logs

Debug Information Attached

  • kubectl get pods -n dra-driver-nvidia-gpu -o wide
  • kubectl get resourceclaims -n <namespace>
  • kubectl get resourceslices.resource.k8s.io
  • kubectl describe pod <pod> or kubectl events --for pod/<pod>
  • Kubelet plugin logs: kubectl logs -n dra-driver-nvidia-gpu -l dra-driver-nvidia-gpu-component=kubelet-plugin --all-containers --prefix --tail=400
  • nvidia-smi output from the host
  • Kubelet logs

IMEX / ComputeDomain Debug Information (if applicable)

  • Host IMEX service disabled: systemctl status nvidia-imex.service (must be masked/disabled)
  • Node clique labels: kubectl get nodes -L nvidia.com/gpu.clique
  • Per-GPU clique info: nvidia-smi -q | grep -E 'ClusterUUID|CliqueId'
  • NVLink/NVSwitch topology: nvidia-smi topo -m
  • ComputeDomain status: kubectl get computedomains.resource.nvidia.com -o yaml
  • ComputeDomainClique status: kubectl get computedomaincliques.resource.nvidia.com -o yaml
  • IMEX daemon pods: kubectl get pods -n <namespace> -l resource.nvidia.com/computeDomain
  • IMEX daemon logs: kubectl logs -n dra-driver-nvidia-gpu -l resource.nvidia.com/computeDomain --all-containers --prefix --tail=-1
  • IMEX domain status: nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg -N

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

Status
In-Review

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions