Skip to content

Add daemon set as a way to deploy device plugin. #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

michalad1
Copy link

Based on docs:
https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/gpu_plugin/advanced-install.md#install-to-all-nodes

Added possibility to deploy device plugin as a daemon set without NFD and operator.

@michalad1 michalad1 requested review from mythi and poussa as code owners May 8, 2025 06:43
@mythi
Copy link
Contributor

mythi commented May 8, 2025

Added possibility to deploy device plugin as a daemon set without NFD and operator.

"without NFD" is already possible. we need a mechanism that covers all plugins and is low maintenance. can you explain what problems the existing setup has?

@michalad1
Copy link
Author

michalad1 commented May 8, 2025

Added possibility to deploy device plugin as a daemon set without NFD and operator.

"without NFD" is already possible. we need a mechanism that covers all plugins and is low maintenance. can you explain what problems the existing setup has?

afaik we need to install operator in order to have it working, but like in mentioned docs it is possible to just deploy daemonset.
in my case we already have a lot of pods and applications so I don't want to install additional applications.

Note: I created this PR as installation of apps in our cluster is done via helm charts so instead of manual deployment of daemon set I wanted to use official helm chart to do the same thing.

Copy link
Contributor

@tkatila tkatila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a note to README about this alternative install method.

@@ -21,3 +21,10 @@ nodeSelector:
tolerations: []

nodeFeatureRule: true

# to preserve backward compatibility
operator: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
operator: true
deployWithoutOperator: false

Comment on lines 28 to 30
# to deploy the device plugin as a DaemonSet
daemonSet:
enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these.

@@ -0,0 +1,79 @@
{{- if .Values.daemonSet.enabled }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use deployWithoutOperator

@@ -2,7 +2,7 @@
based on
deployments/operator/samples/deviceplugin_v1_gpudeviceplugin.yaml
*/}}

{{- if .Values.operator }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ! deployWithoutOperator

path: /var/run/cdi
type: DirectoryOrCreate
nodeSelector:
kubernetes.io/arch: amd64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use nodeSelector from values.

Copy link
Author

@michalad1 michalad1 May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, but default value is:
intel.feature.node.kubernetes.io/gpu: 'true'

and I could not find a way to replace this selector with different one.

metadata:
labels:
app: intel-gpu-plugin
spec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have the tolerations defined here as well.

Comment on lines 29 to 32
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop this. It's only used with resourceManager which is being EoL'd.

spec:
containers:
- name: intel-gpu-plugin
env:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add support to configure GPU plugin with its different modes: sharedDevNum, enableMonitoring, allocationPolicy and logLevel. No need for the "resourceManager" as it's being EoL'd.

@michalad1 michalad1 requested a review from tkatila May 8, 2025 10:39
@mythi
Copy link
Contributor

mythi commented May 9, 2025

Added possibility to deploy device plugin as a daemon set without NFD and operator.

"without NFD" is already possible. we need a mechanism that covers all plugins and is low maintenance. can you explain what problems the existing setup has?

afaik we need to install operator in order to have it working, but like in mentioned docs it is possible to just deploy daemonset. in my case we already have a lot of pods and applications so I don't want to install additional applications.

Note: I created this PR as installation of apps in our cluster is done via helm charts so instead of manual deployment of daemon set I wanted to use official helm chart to do the same thing.

I don't see there's enough justification to accept the maintenance burden especially in this repo that is decoupled from the original reference YAML we have and as long as its GPU-only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants