Persistency model changes #177

avalluri · 2019-02-20T16:07:32Z

Currently supported persistency types are:

cache volume
ephemeral volume

These changes are over csi v1.0, so might not work on kuberntes < v 1.13. The idea is we should maintain pre csi-1.0 changes in a branch(csi-0.0.3). I will review supported persistency models within the limitations of csi v0.0.3 and submit PR to csi-0-0.3 branch.

pohly

So far I've only looked at the documentation to figure out how this is meant to work.

README.md

pohly · 2019-02-21T07:50:03Z

README.md

+volume gets created initially, without actually allocating space. Then
+when an application starts and it is time to mount that "existing"
+volume, an attempt is made to allocate the space required for the
+volume locally. Only if there is enough space can application startup


Can pmem-csi-sequence-diagram.png be deleted now?

Hmm, may be Yes. I was thinking to make it use for illustrating the normal persistent volume type, But may not be needed as Kubernetes users will have that knowledge of how normal dynamic PVs work.

But the current diagram isn't the "normal" PVC, is it? It's the fake PVC with delayed allocation on pod startup.

Do you intend to keep this special mode? I thought we agreed to implement the normal PVC mode (allocate space in CreateVolume) with low priority, because it isn't really needed.

My intention was that reusing the old diagram by making needed changes according to normal PVC flow. There is no fake/delayed volume creation anymore.

I would request you please review the PR code to get how really the cache and ephemeral are implemented.

As I said in my other comment, it is easier to review how this is meant to work by reading the user-visible documentation for it.

It makes sense to update the existing diagram. Right now it still shows that CreateVolume doesn't actually create a volume. Do we agree after merging this PR, CreateVolume will end up creating the pmem device on some node?

README.md

pohly · 2019-02-25T15:43:49Z

Amarnath Valluri <[email protected]> writes:

> What you are saying is that users need to rewrite their app deployment logic and can't use controllers like `ReplicaSet`. "cache" volume are best suits for `ReplicaSet/Deployment` . Pre-create the cached PVC with no of replicas and ask to use that PVC by ReplicaSet.

Whether such a ReplicaSet should use "cache" or "ephemeral" volumes depends on the application. The app might expect empty volumes on startup and might get confused when it finds old data when using "cache" volumes. It's also a waste of disk space to keep the data around after the app has terminated if the data is not going to be used again.

and "ephemeral" volumes are supposed to use by `StatefulSet`, because StatefulSets supports `volumeClaimTemplate` where each volume tied to a PVC, the volumes are not shared.

StatefulSet is meant for "stateful" applications, i.e. those which need to retain data across restarts. I can imagine that also those apps may need local, ephemeral storage, but the way https://akomljen.com/kubernetes-persistent-volumes-with-deployment-and-statefulset/ describes it, volumeClaimTemplate was meant to provide each pod with its own persistent volume.

avalluri · 2019-02-26T07:37:16Z

@pohly, updated the document with Usage details, can you check if it helps.

README.md

avalluri · 2019-02-27T12:50:55Z

@okartau and @pohly I updated the PR with:

reverting ephemeral support
merged default {identity,node}servers code with actual implementation
updated the documentation with persistent models and usage.

deploy/kubernetes-1.13/pmem-app.yaml

deploy/kubernetes-1.13/pmem-csi.yaml

pohly · 2019-02-27T14:33:00Z

test/test-config.sh

@@ -48,6 +48,10 @@ TEST_PMEM_MEM_SIZE=32768 # 32GB
 TEST_PMEM_SHARE=on
 TEST_PMEM_LABEL_SIZE=2097152

+# Kubernetes feature gates to enable/disable
+# featurename=true,feature=false
+TEST_FEATURE_GATES="CSINodeInfo=true,CSIDriverRegistry=true"


The installation of the corresponding CRDs also belongs into the cluster setup. Currently the PR has that as part of the driver installation (like in some older CSI documentation), but that's conceptually the wrong place and was just a workaround for cluster setup tools not knowing about those CRDs yet.

agree. Done

test/start-stop.make

test/e2e/storage/csi_volumes.go

deploy/kubernetes-1.11/pmem-unified-csi.yaml

pohly

Commit message e81ec79:

the newly created volume to that node by using 'Volume.Topology'. OC has to make

s/OC/The container orchestrator/

Let's avoid acronyms, the reader might not know them. It was spelled wrong, too.

Unlike normal persistent volume, Cache volume creates set of pmem volumes on

You've spelled "Cache" and further down also "Persistent" with capital letter when talking about the model. I think "cache volume" and "persistent volume" are normal English words and not specific names, so normal capitalization seems more appropriate.

deploy/kubernetes-1.13/pmem-pvc.yaml says storageClassName: pmem-csi-sc # defined in pmem-csi.yaml. That should say deploy/kubernetes-1.13/pmem-storageclass.yaml.

I just brought up a cluster based on e81ec79 and followed the example (cat deploy/kubernetes-1.13/pmem-pvc.yaml | _work/ssh-clear-kvm kubectl create -f -). The resulting PV has three entries in its topology information:

$ kubectl get pv/pvc-b0354387-3c0d-11e9-895f-deadbeef0100 -o yaml
...
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - host-3
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - host-1
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - host-2

That doesn't look right. I did make push-images before rebuilding the cluster, so I should be running the latest driver. However, this is hard to verify because the driver does not identify itself in the log output and has no unique version number. I've filed #183 to track this.

From the logs it looks like the driver treated this volume as a cache volume?

$ kubectl log pmem-csi-controller-0 pmem-driver
...
I0301 10:35:10.049651       1 glog.go:58] GRPC call: /csi.v1.Controller/CreateVolume
I0301 10:35:10.049673       1 glog.go:58] GRPC request: name:"pvc-b0354387-3c0d-11e9-895f-deadbeef0100" capacity_range:<required_bytes:8589934592 > volume_capabilities:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > accessibility_requirements:<requisite:<segments:<key:"kubernetes.io/hostname" value:"host-0" > > requisite:<segments:<key:"kubernetes.io/hostname" value:"host-1" > > requisite:<segments:<key:"kubernetes.io/hostname" value:"host-2" > > requisite:<segments:<key:"kubernetes.io/hostname" value:"host-3" > > preferred:<segments:<key:"kubernetes.io/hostname" value:"host-0" > > preferred:<segments:<key:"kubernetes.io/hostname" value:"host-1" > > preferred:<segments:<key:"kubernetes.io/hostname" value:"host-2" > > preferred:<segments:<key:"kubernetes.io/hostname" value:"host-3" > > > 
I0301 10:35:10.049861       1 glog.go:79] CreateVolume: Name: pvc-b0354387-3c0d-11e9-895f-deadbeef0100 req.Required: 8589934592 req.Limit: 0
I0301 10:35:10.049965       1 glog.go:79] Connecting to 192.168.8.8:10001
I0301 10:35:10.114505       1 glog.go:79] node host-3 has requested capacity
I0301 10:35:10.114536       1 glog.go:79] Connecting to 192.168.8.4:10001
I0301 10:35:10.181366       1 glog.go:79] node host-1 has requested capacity
I0301 10:35:10.181404       1 glog.go:79] Connecting to 192.168.8.6:10001
I0301 10:35:10.249569       1 glog.go:79] node host-2 has requested capacity
I0301 10:35:10.249621       1 glog.go:79] Found nodes: [host-3 host-1 host-2]
I0301 10:35:10.249667       1 glog.go:79] CreateVolume: Record new volume as {b03c6245-3c0d-11e9-ba0d-26c6b0ca2305 pvc-b0354387-3c0d-11e9-895f-deadbeef0100 8589934592 1  true fsdax}

Some more log output would be useful to understand what the driver is doing (which mode, which parameters, etc.).

README.md

pohly · 2019-03-01T09:46:27Z

README.md

+
+Below are the volume persistency models considered for implementing by PMEM-CSI to serve different application use cases:
+
+* Persistent Volumes</br>


Resorting to HTML is unnecessary, plain Markdown also works.

Just use two spaces:

* Persistent Volumes A volume gets created...

https://meta.stackexchange.com/questions/72395/is-it-possible-to-have-definition-lists-in-markdown/298872#298872

Thanks for the tip. I fixed all
s

README.md

pohly · 2019-03-01T10:38:47Z

README.md

+
+Kubernetes cluster administrators can expose above mentioned [volume persistency types](#volume-persistency) to applications using [`StorageClass Parameters`](https://kubernetes.io/docs/concepts/storage/storage-classes/#parameters). An optional `persistencyModel` parameter differentiates how the provisioned volume can be used.
+
+* if no `persistencyModel` parameter specified in `StorageClass` then it is treated as normal Kubernetes persistent volume. In this case PMEM-CSI creates PMEM volume on a node and the application that claims to use this volume is supposed to schedule on this node by Kubernetes. Choosing of node is depend on StorageClass `volumeBindingMode`. In case of `volumeBindingMode: Immediate` PMEM-CSI chooses a node randomly, and in case of `volumeBindingMode: WaitForFirstConsumer` Kubernetes first chooses a node for scheduling the application, and PMEM-CSI creates the volume on that node. Applications which claim a normal persistent volume has to use `ReadOnlyOnce` access mode in its `accessModes` list. This [diagram](/docs/images/sequence/pmem-csi-persistent-sequence-diagram.png) illustrates how a normal persistent volume gets provisioned in Kubernetes using PMEM-CSI driver.


s/on this node/be scheduled onto this node/

avalluri · 2019-03-01T12:43:04Z

That doesn't look right. I did make push-images before rebuilding the cluster, so I should be running the latest driver. However, this is hard to verify because the driver does not identify itself in the log output and has no unique version number. I've filed #183 to track this.

From the logs it looks like the driver treated this volume as a cache volume?

$ kubectl log pmem-csi-controller-0 pmem-driver
...
I0301 10:35:10.049651       1 glog.go:58] GRPC call: /csi.v1.Controller/CreateVolume
I0301 10:35:10.049673       1 glog.go:58] GRPC request: name:"pvc-b0354387-3c0d-11e9-895f-deadbeef0100" capacity_range:<required_bytes:8589934592 > volume_capabilities:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > accessibility_requirements:<requisite:<segments:<key:"kubernetes.io/hostname" value:"host-0" > > requisite:<segments:<key:"kubernetes.io/hostname" value:"host-1" > > requisite:<segments:<key:"kubernetes.io/hostname" value:"host-2" > > requisite:<segments:<key:"kubernetes.io/hostname" value:"host-3" > > preferred:<segments:<key:"kubernetes.io/hostname" value:"host-0" > > preferred:<segments:<key:"kubernetes.io/hostname" value:"host-1" > > preferred:<segments:<key:"kubernetes.io/hostname" value:"host-2" > > preferred:<segments:<key:"kubernetes.io/hostname" value:"host-3" > > > 
I0301 10:35:10.049861       1 glog.go:79] CreateVolume: Name: pvc-b0354387-3c0d-11e9-895f-deadbeef0100 req.Required: 8589934592 req.Limit: 0
I0301 10:35:10.049965       1 glog.go:79] Connecting to 192.168.8.8:10001
I0301 10:35:10.114505       1 glog.go:79] node host-3 has requested capacity
I0301 10:35:10.114536       1 glog.go:79] Connecting to 192.168.8.4:10001
I0301 10:35:10.181366       1 glog.go:79] node host-1 has requested capacity
I0301 10:35:10.181404       1 glog.go:79] Connecting to 192.168.8.6:10001
I0301 10:35:10.249569       1 glog.go:79] node host-2 has requested capacity
I0301 10:35:10.249621       1 glog.go:79] Found nodes: [host-3 host-1 host-2]
I0301 10:35:10.249667       1 glog.go:79] CreateVolume: Record new volume as {b03c6245-3c0d-11e9-ba0d-26c6b0ca2305 pvc-b0354387-3c0d-11e9-895f-deadbeef0100 8589934592 1  true fsdax}

I doubt if your cluster is built on e81ec79 . The logs you shared

node host-3 has requested capacity

are not present in this commit. Can you please verify this.

CSI v0.0.3 is deprecated and do not supported by future Kubernetes releases. CSI-1.0 is not backward compatible with older(<= 0.0.3) CSI releases. Kubernetes releases earlier to v1.13+ does not support csi-1.0, so we can not support older Kubernetes releases. Made changes to driver deployment files to support csi-1.0 on Kubernetes v1.13: - dropped deployment files under Kubernetes-1.{11,12} - update sidecar versions to v1.0.1 Signed-off-by: Amarnath Valluri <[email protected]>

These features are needed to make the driver fully topology aware. external-provisioner reads topology keys from CSINodeInfo objects and uses those key to prepare CreateVolumeRequest.TopologyRequirement. Made changes to test cluster setup script to enable these feature gates using kubeadm configuration file, while setting up the cluster. Signed-off-by: Amarnath Valluri <[email protected]>

Motivation behind this change is: - we can make node controller as self contained, i.e, each node can create and delete volumes independent of controller. - clearly divides what parts are handled by master and node, say for example nothing related to DeviceManager should be in master code. - this is an enabler change to implement fully topology aware provisioning by dropping workaround solution of creating pmem device on attachment.

These default generic implementations were inherited from deprecated kubernetes-csi/drivers/csi-common package, which is not needed anymore. Merging them with their derived implementations. This simplifies the code and leads to cleaning up lot of unused code. Left controller server default implementation, as it can be shared by both node controller and controller servers. Signed-off-by: Amarnath Valluri <[email protected]>

pohly · 2019-03-01T15:47:13Z

Indeed, I was one commit before the persistency changes. We need this embedded version number to catch such mistakes faster :-/

I'll continue testing.

avalluri · 2019-03-01T23:02:27Z

Commit message e81ec79:

the newly created volume to that node by using 'Volume.Topology'. OC has to make

s/OC/The container orchestrator/
Done

You've spelled "Cache" and further down also "Persistent" with capital letter when talking about the model. I think "cache volume" and "persistent volume" are normal English words and not specific names, so normal capitalization seems more appropriate.

s/Cache/cache, s/Persistent/persistent. Done.

deploy/kubernetes-1.13/pmem-pvc.yaml says storageClassName: pmem-csi-sc # defined in pmem-csi.yaml. That should say deploy/kubernetes-1.13/pmem-storageclass.yaml.

Fixed the comment.

With this change, PMEM volumes are created in CreateVolume() call, irrespective of persistency model. This means there is no 'delayed volume' creation in ControllerPublishVolume. In normal persistent volume case the driver creates a pmem volume on a free node in the order of 'CreateVolumeRequest.TopologyRequirement.Preferred'. And locks the newly created volume to that node by using 'Volume.Topology'. The container orchestrator has to make sure that the application which claims this volume shall run on this node. Unlike normal persistent volume, cache volume creates set of pmem volumes on different number of nodes, each with its own local data, in the order of 'CreateVolumeRequest.TopologyRequirement.Preferred'. Applications are started on those nodes and then get to use the volume on their node. Data persists across application restarts. This is useful when the data is only cached information that can be discarded and reconstructed at any time and the application can reuse existing local data when restarting. Introduced new volume/StorageClass parameters: - "persistencyModel", which shall set to "cache" for cache volumes and - "cacheSize" to request no of nodes the created volume can be used. Updated README with different possible persistency models and Kubernetes usage for cache and persistent volumes. Removed node locking for volume tests, as we implemented full topology and provisioning in cluster supposed to work.

protosanitizer supports suppressing secrets while logging gRPC messages. Signed-off-by: Amarnath Valluri <[email protected]>

pohly

Looks good enough for merging. I found some more typos in the new documentation, but instead of iterating over this again and again I suggest that we merge and then let a docs write proof-read everything and fix those.

Regarding the code: I tried both examples manually and also ran make test_e2e. That worked. We don't have good tests for aspects of the persistency model, though: for example, accidentally creating a cache volume when the user asked for a normal persistent volume won't get caught. Let's add E2E tests separately. I can look into that.

@okartau if you agree, can you merge and rebase your docs PR?

README.md

avalluri · 2019-03-04T12:35:47Z

Offline, @okartau agreed to merge this change.

avalluri mentioned this pull request Feb 20, 2019

split controller code to master and node controllers #164

Closed

avalluri requested review from pohly and okartau February 20, 2019 21:33

pohly suggested changes Feb 21, 2019

View reviewed changes

avalluri force-pushed the csi-1.0+cache branch from 046cadf to 238618b Compare February 25, 2019 19:56

okartau reviewed Feb 27, 2019

View reviewed changes