Skip to content

[WIP] Host managed imex#1163

Open
dims wants to merge 3 commits into
kubernetes-sigs:mainfrom
dims:host-managed-imex
Open

[WIP] Host managed imex#1163
dims wants to merge 3 commits into
kubernetes-sigs:mainfrom
dims:host-managed-imex

Conversation

@dims
Copy link
Copy Markdown
Member

@dims dims commented Jun 1, 2026

What type of PR is this?

Please see docs/proposals/0001-host-managed-imex.md

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation (design docs, usage docs, etc.):


Checklist

  • make check test passes locally
  • make check-generate passes if api/ changed (CRDs, deepcopy, informers, listers, clientset)
  • make check-modules passes if go.mod / go.sum changed
  • Tests added or updated for the change
  • Helm chart (deployments/helm) updated if flags, RBAC, or defaults changed

dims added 3 commits May 29, 2026 09:58
Adds an alpha, install-wide HostManagedIMEX feature gate for clusters
where the operator owns the host nvidia-imex daemon lifecycle. When
enabled, the driver keeps the ComputeDomain API and the DRA channel-0
injection path but stops creating per-ComputeDomain IMEX DaemonSets,
daemon ResourceClaimTemplates, daemon DeviceClasses/RBAC, and
ComputeDomain node labels.

- featuregates: register HostManagedIMEX (alpha, default false) and force
  IMEXDaemonsWithDNSNames and ComputeDomainCliques off before dependency
  validation runs.
- controller: reconcile only the workload ResourceClaimTemplate and the
  ComputeDomain finalizer; report Ready without per-node daemon tracking.
- kubelet plugin: accept only allocationMode Single/empty, reject daemon
  claims, require a non-empty NVLink clique, skip node-label add/remove,
  and omit daemon devices from the published ResourceSlice.
- helm: hide the daemon DeviceClass and daemon RBAC when the gate is on,
  using an explicit "true" check so --set-string ...=false is not treated
  as enabled.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Operator-facing artifacts for the HostManagedIMEX alpha gate (no driver code
change):

- docs/prerequisites.md: a "Host-managed IMEX" subsection — host nvidia-imex
  must be running (not masked), channel-0 device prereqs, the two compatible
  gates are auto-forced off, and Single-only / numNodes:0 guidance.
- demo/specs/imex/host-managed/: a smoke spec (channel0 injection), a negative
  allocationMode:All spec, a DGXC GB200 Helm values overlay (skyhook toleration,
  arm64 controller pin, nvidiaDriverRoot=/run/nvidia/driver for the
  containerized driver), and a README runbook.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
A provisional, KEP-style proposal (per docs/proposals/README.md) for an alpha,
install-wide HostManagedIMEX feature gate: for clusters where the operator owns
the host nvidia-imex daemon, the driver keeps the ComputeDomain API + channel-0
DRA injection but stops creating per-ComputeDomain IMEX DaemonSets. Written
forward-looking; status provisional.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 1, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented Jun 1, 2026

Deploy Preview for dra-driver-nvidia-gpu ready!

Name Link
🔨 Latest commit 4945384
🔍 Latest deploy log https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/6a1da3d2bdf55d0007f0fc85
😎 Deploy Preview https://deploy-preview-1163--dra-driver-nvidia-gpu.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jun 1, 2026
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 1, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 1, 2026
@dims
Copy link
Copy Markdown
Member Author

dims commented Jun 1, 2026

/retest

@shivamerla
Copy link
Copy Markdown
Contributor

@dims we need to merge this fix for the mock nvml test to work again: NVIDIA/go-nvlib#89

@dims
Copy link
Copy Markdown
Member Author

dims commented Jun 1, 2026

@dims we need to merge this fix for the mock nvml test to work again: NVIDIA/go-nvlib#89

@shivamerla Merged!

@shivamerla
Copy link
Copy Markdown
Contributor

/retest


## Summary

This proposal introduces an alpha, install-wide `HostManagedIMEX` feature gate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the nvidia driver also supposed to be host managed in this case?


- Introduces `HostManagedIMEX` at **Alpha** stability, default `false`, applied
install-wide.
- The gate is operationally mutually exclusive with driver-managed daemon
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is imp. Can we add a check to enforce it?


### Feature gate & graduation

- Introduces `HostManagedIMEX` at **Alpha** stability, default `false`, applied
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add a table differentiating HostManaged vs DriverManaged, what are the design differences, components involved, scalability expectation, latency etc

@klueska
Copy link
Copy Markdown
Contributor

klueska commented Jun 3, 2026

By default, I would assume having a host managed IMEX daemon would imply that the DRA driver does channel management on top of it. Meaning that each workload would be assigned a per-clique channel ID to inject into its pods.

It's fine if we (additionally) want to support a mode where users don't care about channel isolation (as this PR does, by only injecting channel 0 into all workloads), but that should be its own flag then.

If this second variant is what you want to support first, then you can create the flag, make it default to doing per-workload channel allocation, but error out in this default mode (forcing one to explicitly set it to "no-isolation" or whatever you want to call it).

So ...

HostManagedIMEX = false --> ignore IsolationStrategy
HostManagedIMEX = true, IsolationStrategy=IMEXDomain --> error
HostManagedIMEX = true, IsolationStrategy=IMEXChannel --> eventually support channel isolation (error out for now)
HostManagedIMEX = true, IsolationStrategy=None --> always inject IMEX channel 0

With all of that said, it feels like this should be actual helm options of sorts and not just feature gates. Feature gates are meant to be something that eventually has a path to being always on by default. You can protect the setting of the helm options by the feature gate, but you shouldn't just use the feature gate as a (forever) optional toggle by itself.

@shivamerla
Copy link
Copy Markdown
Contributor

With all of that said, it feels like this should be actual helm options of sorts and not just feature gates. Feature gates are meant to be something that eventually has a path to being always on by default. You can protect the setting of the helm options by the feature gate, but you shouldn't just use the feature gate as a (forever) optional toggle by itself.

Helm option makes sense here. Something like below.

 computeDomain:
   imex:
     deployment:
       mode: driverManaged|hostManaged
       scope: perJob|perClique (static)
       isolation: channel|daemon

Various combinations possible are driverManaged+perJob+daemon, driverManaged+perClique(static)+channel, hostManaged+perClique+daemon and hostManaged+perClique+channel. Also, these options cannot be toggled during Helm upgrade when active workloads are running. That might be hard to enforce with helm without pre-upgrade hooks. Even with pre-upgrade hook, there are so many combinations where upgrades can go wrong.

  • Switch to hostManaged with active CD instances (workloads + daemons) running - On upgrade, plugin will no longer be able to unprepare claims as it cannot cleanup previously created driverManaged daemons.
  • Switch to driverManaged with active CD instances (workloads + host daemons) running - On upgrade, plugin will try to create/cleanup daemon pods overlapping with host daemons with subsequent preare/unprepare calls.
  • Switch to driverManaged+perJob with active CD instances (workloads + per Clique daemons) running. On upgrade, plugin will attempt to create overlapping daemons and also attempt to cleanup perJob daemon pods which will be non-existing.
  • Switch to driverManaged+perClique with active CD instances (workloads + per Job daemons) running. On upgrade, plugin will ignore cleanup of perJob daemons that were previously prepared.
  • Switch to driverManaged+perClique+channel with active CD instances (workloads + per Job daemons) running. (channel based isolation is not possible for current running workloads)
  • Switch to hostManaged+perClique+daemon with active CD instances (workloads running + per Clique + channel isolation enabled). Workloads will be consuming non-zero channels and on restart plugin will end up publishing only channel 0.

So I think these should be Helm options, but with a clear upgrade rule that changing any of mode, scope, or isolation requires draining/removing active ComputeDomains and dependent workloads first. @dims shall i can expand on this PR and add those changes.

@dims
Copy link
Copy Markdown
Member Author

dims commented Jun 4, 2026

@dims shall i can expand on this PR and add those changes.

YES please go for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

5 participants