fix mps env var for chroot#1143
Conversation
✅ Deploy Preview for dra-driver-nvidia-gpu ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: guptaNswati The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| if [ -x /driver-root/bin/sh ] || [ -x /driver-root/usr/bin/sh ]; then | ||
| # Use chroot to avoid library mismatch between container and host | ||
| # when driver root is / (default value) or /run/nvidia/driver (default location for driver installation by GPU Operator) | ||
| # Export the paths explicitly for the chroot environment |
There was a problem hiding this comment.
these are set via CDI edits that we generate. May be you have CDI disabled on your system?
There was a problem hiding this comment.
need to check that. i have lost access to the system.
There was a problem hiding this comment.
I checked again, this issue persists even when CDI is enabled. I have containerd github.com/containerd/containerd/v2 v2.2.3 on the machine where cdi is enabled by default.
=== kind-dra-1-worker ===
/etc/containerd/config.toml: enable_cdi = true
=== kind-dra-1-control-plane ===
/etc/containerd/config.toml: enable_cdi = true
There was a problem hiding this comment.
BTW like you said its not specific to GB10, earlier i also tested on A30.
There was a problem hiding this comment.
This is what that edit command is doing but the MPS daemon is hung up with CUDA_MPS_PIPE_DIRECTORY must be set to start or to communicate with MPS control daemon
Starting MPS control daemon for '374aeab2-ca3c-47f8-8d7d-0872d18c1b69-2627f', with settings: &{DefaultActiveThreadPercentage:0x1782eab8b120 DefaultPinnedDeviceMemoryLimit:10Gi DefaultPerDevicePinnedMemoryLimit:map[]}
I0528 23:41:56.472948 1 mount_linux.go:259] Mounting cmd (mount) with arguments (-t tmpfs -o rw,nosuid,nodev,noexec,relatime,size=63767666k shm /var/lib/kubelet/plugins/gpu.nvidia.com/mps/374aeab2-ca3c-47f8-8d7d-0872d18c1b69-2627f/shm)
What type of PR is this?
/kind bug
What this PR does / why we need it:
Special notes for your reviewer:
Found MPS daemon crashlooping because of missing env
Does this PR introduce a user-facing change?
Additional documentation (design docs, usage docs, etc.):
Tested on GB10 and A30.
After setting the env