Skip to content

Conversation

@olevski
Copy link
Contributor

@olevski olevski commented Oct 25, 2021

I noticed that there does not seem to be an option to add tolerations, affinities, etc.

This is definitely useful when you have nodes with taints but when you want the daemonSets (for node plugins) to run on all of your nodes. I could not find in the current state of the code any way to do this.

So I followed how affinities/tolerations/nodeSelectors are added when you create a helm chart with the helm CLI (i.e. helm create foo). Unlike the helm CLI template though I did separate things for the DaemonSets and everything else (i.e. controllers/attachers). For example you may want to run the controller on a specific node but the DaemonSet on all nodes (or on more nodes).

Let me know what you think. I am happy to make adjustments.

Signed-off-by: Tasko Olevski [email protected]

@olevski
Copy link
Contributor Author

olevski commented Oct 25, 2021

The build of the ppc64le image errored out. It seems to have just stalled or timed out. I think just a simple rerun should help because I did not touch any of the Docker images with this PR.

@YiannisGkoufas
Copy link
Collaborator

Thanks @olevski for the PR will have a look and let you know!

@leonardonormando
Copy link

About NodeAffinity, i've got the following error when trying to mount a volume in a pod with Affinity and/or Tolerations set:

MountVolume.MountDevice failed for volume "pvc-XXX" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name ch.ctrox.csi.s3-driver not found in the list of registered CSI drivers

I tried setting the nodeAffinity in the Persistent Volume associated with the dataset, but not luck.

If I clear the affinity and tolerations from the pod, the volume is mounted correctly.

@olevski
Copy link
Contributor Author

olevski commented Dec 9, 2021

About NodeAffinity, i've got the following error when trying to mount a volume in a pod with Affinity and/or Tolerations set:

MountVolume.MountDevice failed for volume "pvc-XXX" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name ch.ctrox.csi.s3-driver not found in the list of registered CSI drivers

I tried setting the nodeAffinity in the Persistent Volume associated with the dataset, but not luck.

If I clear the affinity and tolerations from the pod, the volume is mounted correctly.

@leonardonormando I am not sure I fully follow what you did.

This PR adds NodeAffinity for several different components. For some of them (like the plugins) this is probably not very useful. The plugins are daemon sets and they run on all nodes so setting the NodeAffinity there is a bit redundant. But setting the toleration there is useful - especially if you have nodes that have specific taints but you would like to use datashim on the pods that run on those nodes.

Can you provide the values.yaml file you used to deploy datashim so we are clear which affinities and tolerations you set?

I tried setting the nodeAffinity in the Persistent Volume associated with the dataset, but not luck.

Huh I also did now even know you could set a NodeAffinity on a PV. (https://kubernetes.io/docs/concepts/storage/persistent-volumes/#node-affinity) But it turns out that indeed you can. This PR does not assign affinities on the PVs. Only on the deployments/daemonsets/statefulsets. According to the docs when you set a NodeAffinity on a volume then it limits which nodes can access said volume. So if you specified a nodeAffinity for nodes A on the PV but the pod is running on some other node B, then it makes sense why you could not access it.

If you share you values file and the manifests for the pod and the dataset you are trying to run it will be helpful. Otherwise without more info I can just guess what the problem is.

@leonardonormando
Copy link

Sorry! I should have been more clear about the setup:

I did not tested the helm chart (yet) and used the manifest in the repository as is (dlf.yaml) to deploy Datashim.

I know this PR is about setting nodeAffinity via helm chart, but I thought it will be beneficial to bring this discussion over here.

If I use these manifests for dataset and pod creation it works as expected:

apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
  namespace: sandbox
  name: dataset-sandbox
spec:
  local:
    type: 'COS'
    accessKeyID: 'XXX'
    secretAccessKey: 'XXX'
    endpoint: 'http://minio.minio.svc.cluster.local:9000'
    bucket: 'sandbox-bucket'

---
apiVersion: v1
kind: Pod
metadata:
  name: test-bucket-mount
  namespace: sandbox
spec:
  volumes:
    - name: test-mount
      persistentVolumeClaim:
        claimName: dataset-sandbox
  containers:
    - name: test
      image: ubuntu:20.04
      command: ['/bin/sleep', '3650d']
      volumeMounts:
        - mountPath: '/test'
          name: test-mount

However, if I set any Affinity or Toleration to the pod, the volume can't be mounted with error reported above.

Am I missing something?

Thanks!

@olevski
Copy link
Contributor Author

olevski commented Dec 14, 2021

However, if I set any Affinity or Toleration to the pod, the volume can't be mounted with error reported above.

@leonardonormando to me what you describe makes sense. I am pretty sure that by setting the affinity and toleration on the pod that is using the dataset makes it get scheduled on a node where the daemonset (that is supposed to get scheduled on all nodes) is actually not running. And because the pod from the daemonset is not running on this node things dont work.

When you remove the affinity and toleration from your test pod then it gets scheduled on a node where the datashim daemonset pods are scheduled and things work as expected.

This PR aims to fix exactly this problem by adding the ability to assign a toleration on the daemonsets so they truly run on all nodes.

You can use kustomize on the rendered manifests you deploy to add a toleration so that they scheduled on all nodes. And then things should work as expected.

@olevski
Copy link
Contributor Author

olevski commented Dec 14, 2021

The daemonset that I am referring to is here: https://github.com/datashim-io/datashim/blob/master/release-tools/manifests/dlf.yaml#L1059. This is where you should add the same toleration you are adding in your pod that is trying to use the dataset.

@leonardonormando
Copy link

leonardonormando commented Dec 15, 2021

Thank you, very much, @olevski for the detailed explanation. Indeed that solved the problem.

As soon as I can, I will also test the helm for this PR

Thanks @YiannisGkoufas and @olevski for such great work!

@srikumar003
Copy link
Collaborator

Hi @olevski thanks for the PR and apologies for taking so long to get around to it. The ability to add tolerations is very useful. I also like how the values are distributed over the different components.

However, I have a few issues:

  1. The nodeSelector is useful for the operator and generate-keys. In case of the CSI-* plugins, it excludes daemonSets from some nodes. This could lead to the situation demonstrated by @leonardonormando where a pod with a dataset label will be scheduled to a node without the required daemonSets. I suggest we remove nodeSelector from the charts (except for dataset-operator)

However, if the controller were aware that some nodes are not running the required csi-* plugins, then it could add nodeSelector to the dataset/PV. We could then use the webhook to mutate the Pods with nodeSelector. This could be a potentially useful feature, though I cannot now imagine a scenario for it. Please add an issue if you think this is useful.

  1. As you mentioned, affinities are redundant for daemonSets. Let's remove them

  2. The attachers and provisioners of the plugins have separate blocks for taints. Could lead to problems with deployment configuration. I'll mention on the reviews of the files.

If you can modify and resubmit, I'll review and accept the PR. Thanks again! (@leonardonormando thank you for the comments and please do let us know if you have tested the helm chart)

@olevski
Copy link
Contributor Author

olevski commented Jan 18, 2022

@srikumar003 thanks for the review. I am making the changes rn.

On a slightly unrelated note is it possible to have a quick chat with you and/or whoever else is maintaining this project? We are very interested in using this over at https://renkulab.io/ but we are a bit unclear on what is the long term goal/plan for datashim.io. Can we touch base via email with this regard. My email is [email protected].

Signed-off-by: Tasko Olevski <[email protected]>
@olevski olevski force-pushed the add-affinities-tollerations branch from 669e10a to f7d2b4a Compare January 18, 2022 16:43
@olevski olevski requested a review from srikumar003 January 18, 2022 16:44
Copy link
Collaborator

@srikumar003 srikumar003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on minikube as follows and it works!

$ minikube start --nodes 2 -p multinode-demo

$ kubectl get nodes
NAME                 STATUS   ROLES                  AGE    VERSION
multinode-demo       Ready    control-plane,master   8m1s   v1.22.3
multinode-demo-m02   Ready    <none>                 7m4s   v1.22.3

$ kubectl taint node multinode-demo node-role.kubernetes.io/master:NoSchedule

$ make deployment

$ kubectl get pods -n dlf -o wide
NAME                                READY   STATUS             RESTARTS      AGE     IP             NODE                 NOMINATED NODE   READINESS GATES
csi-attacher-nfsplugin-0            2/2     Running            0             3m15s   10.244.1.5     multinode-demo-m02   <none>           <none>
csi-attacher-s3-0                   1/1     Running            0             3m15s   10.244.1.7     multinode-demo-m02   <none>           <none>
csi-controller-h3-0                 2/3     CrashLoopBackOff   1 (11s ago)   3m15s   10.244.1.6     multinode-demo-m02   <none>           <none>
csi-nodeplugin-h3-9gz8t             2/2     Running            0             3m15s   192.168.58.3   multinode-demo-m02   <none>           <none>
csi-nodeplugin-nfsplugin-m7s55      2/2     Running            0             3m15s   192.168.58.3   multinode-demo-m02   <none>           <none>
csi-provisioner-s3-0                1/1     Running            0             3m15s   10.244.1.8     multinode-demo-m02   <none>           <none>
csi-s3-6mztz                        2/2     Running            0             3m15s   10.244.1.3     multinode-demo-m02   <none>           <none>
dataset-operator-644f75fcbb-qz9nb   1/1     Running            0             3m15s   10.244.1.4     multinode-demo-m02   <none>           <none>

$ cat chart/charts/dataset-operator-chart/values.yaml
…
nodeSelector:
  node-role.kubernetes.io/master: ''
tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
affinity: {}

$ cat chart/charts/csi-s3-chart/values.yaml 
…
tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master

$ make manifests

$ make deployment

$  kubectl get pods -n dlf -o wide
NAME                                READY   STATUS              RESTARTS   AGE    IP             NODE                 NOMINATED NODE   READINESS GATES
csi-attacher-nfsplugin-0            2/2     Running             0          4s     10.244.1.16    multinode-demo-m02   <none>           <none>
csi-attacher-s3-0                   0/1     ContainerCreating   0          4s     <none>         multinode-demo       <none>           <none>
csi-controller-h3-0                 2/2     Running             0          4s     10.244.1.17    multinode-demo-m02   <none>           <none>
csi-nodeplugin-h3-g8l7q             2/2     Running             0          104s   192.168.58.3   multinode-demo-m02   <none>           <none>
csi-nodeplugin-nfsplugin-29r4r      2/2     Running             0          104s   192.168.58.3   multinode-demo-m02   <none>           <none>
csi-provisioner-s3-0                0/1     ContainerCreating   0          4s     <none>         multinode-demo       <none>           <none>
csi-s3-4zdxj                        2/2     Running             0          103s   10.244.1.15    multinode-demo-m02   <none>           <none>
csi-s3-zmjnv                        2/2     Running             0          103s   10.244.0.3     multinode-demo       <none>           <none>
dataset-operator-7654bd6c44-h8cd7   0/1     Init:0/1            0          4s     <none>         multinode-demo       <none>           <none>

@srikumar003 srikumar003 merged commit c2f4492 into datashim-io:master Jan 19, 2022
srikumar003 pushed a commit that referenced this pull request Jan 19, 2022
* feat(chart): add nodeSelector, tolerations, affinity

Signed-off-by: Tasko Olevski <[email protected]>

* squashme: address comments

Signed-off-by: Tasko Olevski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants