Component
gpu-kubelet-plugin
Bug Description
See #958 for details.
The issue also exists for MPS daemonsets
Steps to Reproduce
- Deploy GPU DRA using an image from a private repo
- Create a ResourceClaim with MPS sharing strategy
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: gpu-mps-app
spec:
devices:
requests:
- name: mps-gpu
exactly:
deviceClassName: gpu.nvidia.com
config:
- requests: ["mps-gpu"]
opaque:
driver: gpu.nvidia.com
parameters:
apiVersion: resource.nvidia.com/v1beta1
kind: GpuConfig
sharing:
strategy: MPS
mpsConfig:
defaultActiveThreadPercentage: 33
defaultPinnedDeviceMemoryLimit: 5Gi
- Create a Kubernetes resource to consume the ResourceClaim
- Observe that MPS daemonset is stuck in ImagePullBackOff
Expected Behavior
The MPS daemonset should pull the private image using the same image pull secret to create rest of the DRA components
DRA Driver Version
v25.12.0
Kubernetes Version
k8s v1.35v
GPU Model
NVIDIA A100
NVIDIA Driver Version
580.105.08
OS / Kernel
No response
Container Runtime
No response
Feature Gates (non-default settings)
No response
Helm Values (non-default)
Relevant Logs
Debug Information Attached
IMEX / ComputeDomain Debug Information (if applicable)
Component
gpu-kubelet-plugin
Bug Description
See #958 for details.
The issue also exists for MPS daemonsets
Steps to Reproduce
Expected Behavior
The MPS daemonset should pull the private image using the same image pull secret to create rest of the DRA components
DRA Driver Version
v25.12.0
Kubernetes Version
k8s v1.35v
GPU Model
NVIDIA A100
NVIDIA Driver Version
580.105.08
OS / Kernel
No response
Container Runtime
No response
Feature Gates (non-default settings)
No response
Helm Values (non-default)
Relevant Logs
Debug Information Attached
kubectl get pods -n dra-driver-nvidia-gpu -o widekubectl get resourceclaims -n <namespace>kubectl get resourceslices.resource.k8s.iokubectl describe pod <pod>orkubectl events --for pod/<pod>kubectl logs -n dra-driver-nvidia-gpu -l dra-driver-nvidia-gpu-component=kubelet-plugin --all-containers --prefix --tail=400nvidia-smioutput from the hostIMEX / ComputeDomain Debug Information (if applicable)
systemctl status nvidia-imex.service(must be masked/disabled)kubectl get nodes -L nvidia.com/gpu.cliquenvidia-smi -q | grep -E 'ClusterUUID|CliqueId'nvidia-smi topo -mkubectl get computedomains.resource.nvidia.com -o yamlkubectl get computedomaincliques.resource.nvidia.com -o yamlkubectl get pods -n <namespace> -l resource.nvidia.com/computeDomainkubectl logs -n dra-driver-nvidia-gpu -l resource.nvidia.com/computeDomain --all-containers --prefix --tail=-1nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg -N