-
Notifications
You must be signed in to change notification settings - Fork 18
Add daemon set as a way to deploy device plugin. #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
"without NFD" is already possible. we need a mechanism that covers all plugins and is low maintenance. can you explain what problems the existing setup has? |
afaik we need to install operator in order to have it working, but like in mentioned docs it is possible to just deploy daemonset. Note: I created this PR as installation of apps in our cluster is done via helm charts so instead of manual deployment of daemon set I wanted to use official helm chart to do the same thing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add a note to README about this alternative install method.
charts/gpu-device-plugin/values.yaml
Outdated
@@ -21,3 +21,10 @@ nodeSelector: | |||
tolerations: [] | |||
|
|||
nodeFeatureRule: true | |||
|
|||
# to preserve backward compatibility | |||
operator: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operator: true | |
deployWithoutOperator: false |
charts/gpu-device-plugin/values.yaml
Outdated
# to deploy the device plugin as a DaemonSet | ||
daemonSet: | ||
enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these.
@@ -0,0 +1,79 @@ | |||
{{- if .Values.daemonSet.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use deployWithoutOperator
@@ -2,7 +2,7 @@ | |||
based on | |||
deployments/operator/samples/deviceplugin_v1_gpudeviceplugin.yaml | |||
*/}} | |||
|
|||
{{- if .Values.operator }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ! deployWithoutOperator
path: /var/run/cdi | ||
type: DirectoryOrCreate | ||
nodeSelector: | ||
kubernetes.io/arch: amd64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use nodeSelector from values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, but default value is:
intel.feature.node.kubernetes.io/gpu: 'true'
and I could not find a way to replace this selector with different one.
metadata: | ||
labels: | ||
app: intel-gpu-plugin | ||
spec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have the tolerations
defined here as well.
- name: HOST_IP | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: status.hostIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop this. It's only used with resourceManager
which is being EoL'd.
spec: | ||
containers: | ||
- name: intel-gpu-plugin | ||
env: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add support to configure GPU plugin with its different modes: sharedDevNum, enableMonitoring, allocationPolicy and logLevel. No need for the "resourceManager" as it's being EoL'd.
I don't see there's enough justification to accept the maintenance burden especially in this repo that is decoupled from the original reference YAML we have and as long as its GPU-only. |
Based on docs:
https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/gpu_plugin/advanced-install.md#install-to-all-nodes
Added possibility to deploy device plugin as a daemon set without NFD and operator.