-
Couldn't load subscription status.
- Fork 580
doc: design doc for non-graceful node shutdown #5409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: design doc for non-graceful node shutdown #5409
Conversation
738c3b0 to
92347a4
Compare
92347a4 to
8513a0c
Compare
| @@ -0,0 +1,127 @@ | |||
| # Non graceful node shutdown | |||
|
|
|||
| In Kubernetes, when a node becomes unhealthy or is intentionally drained, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unhealthy is also used for volumes, maybe call it disfunctional?
| up the volume (via `NodeUnstageVolume` and `NodeUnpublishVolume`), the CSI driver | ||
| has no opportunity to revoke the node’s access to the volume.The node may still | ||
| hold active mounts, open file handles, or client sessions. This can lead to data | ||
| corruption due to writes from disconnected yet still-active clients. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disconnected isn't really correct, as a disconnected client can not reach the storage. A client may re-connect later on, causing havoc. Or, the node may be somehow broken, but apps may still be running, this is the major concern.
| To ensure safe volume reuse and prevent stale client access during node disruptions, | ||
| the proposed solution is to track the client address during the `NodeStageVolume()` | ||
| operation and store it in the image or subvolume metadata under the key: | ||
| `csi.storage.k8s.io/clientAddress/<NodeId>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
csi.storage.k8s.io isn't appropriate, use csi.ceph.io or something similar instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure that this is not copied to the clones and snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about below
- For RBD using something like
.rbd.csi.ceph.comso that we dont need to worry about the replication but we still need to take care of clone/snapshots - For cephfs use
.cephfs.csi.ceph.com?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
➕ .rbd.csi.ceph.com and .cephfs.csi.ceph.com will be consistent with other metadata key as well.
ceph-csi/internal/rbd/encryption.go
Lines 57 to 66 in cfb3979
| encryptionMetaKey = "rbd.csi.ceph.com/encrypted" | |
| oldEncryptionMetaKey = ".rbd.csi.ceph.com/encrypted" | |
| // metadataDEK is the key in the image metadata where the (encrypted) | |
| // DEK is stored. | |
| metadataDEK = "rbd.csi.ceph.com/dek" | |
| oldMetadataDEK = ".rbd.csi.ceph.com/dek" | |
| // luks2 header size metadata key. | |
| luks2HeaderSizeKey = "rbd.csi.ceph.com/luks2HeaderSize" |
| csi.storage.k8s.io/controller-publish-secret-namespace | ||
| ``` | ||
|
|
||
| **Solution 1**: Fetch secrets somehow using volumeID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how?
| To ensure safe volume reuse and prevent stale client access during node disruptions, | ||
| the proposed solution is to track the client address during the `NodeStageVolume()` | ||
| operation and store it in the image or subvolume metadata under the key: | ||
| `csi.storage.k8s.io/clientAddress/<NodeId>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure that this is not copied to the clones and snapshots.
| (`NodeId` from the plugin container argument `--nodeid`) | ||
|
|
||
| ``` | ||
| csi.storage.k8s.io/clientAddress/<NodeId>: <clientAddress+nonce> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This steps is missing the details to get the clientAdress+nonce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Madhu-1, does this get us the required address?
ceph-csi/internal/util/connection.go
Lines 154 to 163 in cfb3979
| // GetAddrs returns the addresses of the RADOS session, | |
| // suitable for blocklisting. | |
| func (cc *ClusterConnection) GetAddrs() (string, error) { | |
| if cc.conn == nil { | |
| return "", errors.New("cluster is not connected yet") | |
| } | |
| return cc.conn.GetAddrs() | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will give us the IP address not nonce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh okay, from comment I thought we get the IP+nonce
ceph-csi/internal/csi-addons/rbd/network_fence.go
Lines 153 to 163 in cfb3979
| address, err := conn.GetAddrs() | |
| if err != nil { | |
| return nil, status.Errorf(codes.Internal, "failed to get client address: %s", err) | |
| } | |
| // The example address we get is 10.244.0.1:0/2686266785 from | |
| // which we need to extract the IP address. | |
| addr, err := nf.ParseClientIP(address) | |
| if err != nil { | |
| return nil, status.Errorf(codes.Internal, "failed to parse client address: %s", err) | |
| } |
For RBD, we can get the clientAddres+nonce from below node path
[root@c1-m03 /]# cat /sys/devices/rbd/0/client_addr
192.168.39.34:0/2458835906
suggested by @nixpanic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Madhu-1, does this get us the required address?
ceph-csi/internal/util/connection.go
Lines 154 to 163 in cfb3979
// GetAddrs returns the addresses of the RADOS session, // suitable for blocklisting. func (cc *ClusterConnection) GetAddrs() (string, error) { if cc.conn == nil { return "", errors.New("cluster is not connected yet") } return cc.conn.GetAddrs() }
I tried this, but it returns a different nonce, IP being same.
The method mentioned earlier in previous comment only applies to block devices. (/sys/devices/rbd/0/client_addr).
I haven't found any other way to retrieve the IP and nonce together except calling rbd status/ceph tell API calls.
That said, I don't see any difference between blocklisting by just IP vs IP+nonce. Since the primary goal is to block the access from the tainted node, so blocklisting by IP alone should suffice?
| ``` | ||
|
|
||
| - **For CephFS**: | ||
| - List active clients and match against `clientAddress` to get the `clientId`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not going to store it in the metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could store the clientId for CephFs subvolumes, that would add one extra metadata entry but could save us from listing the active clients and matching operation for each subvolume request.
8513a0c to
fe22844
Compare
| has no opportunity to revoke the node's access to the volume. The node may still | ||
| hold active mounts, open file handles, or client sessions. This can lead to data | ||
| corruption as applications may still be running on the broken node with active | ||
| client sessions, even though the node is marked as out of service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown says that node.kubernetes.io/out-of-service taint should be added only if "the node is already in shutdown or power off state (not in the middle of restarting)". How can there be active client sessions and applications still be running there if the node is physically powered off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@idryomov we are trying to cover the worst case where we lost access to the nodes and we want to block the client on that node to avoid the problems to be on safer side. When the taint is added the kubernetes assumes that node is already in power off state and starts moving the work load to other nodes, we are trying to have more fencing mechanism to ensure that we are fencing off the node to avoid the access to the client (if they run or ever comes back when the taint still exists)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the point when the taint is removed? Since the admin can apply the taint without knowing that the node is actually powered off (and that being precisely what this design is trying to guard against), I'm wondering whether the admin can also remove the taint whenever they feel like it?
Put differently, is it enforced that the node goes through (an equivalent of) a power cycle before the taint is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the node is recovered and admin can access the node, If the node is available (kubelet+api-server) ensure that all pods are removed from the node (to keep it consistent with the ETCD data of pods) and no new pods are scheduled on the node.
When everything is fine the admin can remove the taint and mark the node available for new pods scheduling. The expectation is that the admin will remove the taints when there are no pods running on the node. As we are implementing this in the RPC spec. CSI takes care of unfencing (when a pod is scheduled for the node) and allow the future operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the term "recovered" include a power cycle?
No pods running on the node is very different from (an equivalent of) a power cycle. Let's say the taint gets added on an active/running node with a bunch of mapped RBD devices and the corresponding mounts. Later, the admin realizes that they screwed up and decides to remove the taint. What is supposed to ensure that those mounts are teared down before Ceph CSI unfences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kubelet will take care of it as soon the node comes back.
Can you go into more detail here? How exactly does kubelet do that? If one of the steps there fails for some reason, is the admin prevented from removing the taint?
And later its admin responsibility to power of the node as well.
... but this isn't enforced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kubelet will take care of it as soon the node comes back.
Can you go into more detail here? How exactly does kubelet do that? If one of the steps there fails for some reason, is the admin prevented from removing the taint?
I havent looked into the kubelet code but we can check on that one @iPraveenParihar PTAL. Not yet, kubernetes doesnt have any checks before removing the taint, its left to the admin to validate everything and remove the taint (we can also document if anything is required from our side as well)
And later its admin responsibility to power of the node as well.
... but this isn't enforced?
yes nothing is enforced in kubernetes its only documented step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I have tested...
Let's say the taint gets added on an active/running node with a bunch of mapped RBD devices and the corresponding mounts. Later, the admin realizes that they screwed up and decides to remove the taint. What is supposed to ensure that those mounts are teared down before Ceph CSI unfences?
Scenario 1: Node is healthy (Node is Ready, kubelet is running), taints are removed without reboot.
-
If a taint is added to a healthy node where the kubelet is functional, existing pods are marked for deletion but stay in a Terminating or Error state.
-
These pods aren’t forcefully deleted because the node's status is not NotReady.
-
NodeUnpublishVolume and NodeUnstageVolume are not called, since the CSI nodeplugin pods are also removed.
-
ControllerUnpublishVolume is not triggered either — VolumeAttachment objects stay intact until pods are forcefully deleted.
-
New pods scheduled to another node will fail with attach errors due to the image still being mapped.
-
The admin realizes that they screwed up and decides to remove the taint, nodeplugin pods come back up, kubelet issues the pending NodeUnpublishVolume/NodeUnstageVolume calls for the pod/s in terminating/error state, mappings are removed, and new pods on the other node can finally attach the volume successfully.
I'm trying to explore the scenario described in this doc -- one where NodeUnstageVolume and NodeUnpublishVolume calls either aren't issued at all or don't make it and only ControllerUnpublishVolume goes in effect. If the node isn't powered off, the mounts would still be there and would be left behind, right? My understanding is that only blocklisting would occur as part of handling ControllerUnpublishVolume.
Scenario 2: Node is unresponsive (Node is NotReady, kubelet is not running), taints are removed without reboot
-
The unresponsive node is tainted.
-
Pods are forcefully deleted.
-
VolumeAttachment deletion triggers ControllerUnpublishVolume, which blocks the RBD image.
-
New pods are scheduled on other nodes and can attach without issue.
-
Here, now if the node comes back without a reboot and Kubelet is running, NodeUnpublishVolume/NodeUnstageVolume would not be called as the pod/s is already deleted. Since the node hasn’t rebooted, the original RBD device mappings and mount points still persist.
The kubelet will take care of it as soon the node comes back. And later its admin responsibility to power of the node as well. i hope @iPraveenParihar already tested this case as well.
@Madhu-1, the mounts and mappings will persist unless the node is rebooted. That’s why we discussed that the admin should only remove the taint after the node has come back from a successful shutdown. This ensures clean teardown and avoids stale device state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, now if the node comes back without a reboot and Kubelet is running, NodeUnpublishVolume/NodeUnstageVolume would not be called as the pod/s is already deleted. Since the node hasn’t rebooted, the original RBD device mappings and mount points still persist.
This is exactly what I suspected and why I kept probing here ;) This should be obvious by now but I want to restate it just in case: if the node a) isn't properly cleaned up (i.e. all mounts and RBD device mappings are teared down via NodeUnpublishVolume/NodeUnstageVolume) or b) doesn't go through a power cycle to waive any concerns around cleanup, removing the blocklist entry in ControllerPublishVolume can lead to data corruption. This was added after my earlier comment
# The blocklist must persist until we can confirm the node has gone through
# a complete power cycle, as premature expiration could lead to data corruption
but not in the main body of the doc and previously wasn't mentioned at all. I'd suggest highlighting this in a separate paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @idryomov for probing on this one to ensure we dont miss anything and we document it :) , @iPraveenParihar Thanks for confirming as my memory is really old for this as this exists for many releases and i tested very long back. We need to document this in our documents what is the exception from the ceph-csi to avoid any problems related to data.
| - **For RBD**: | ||
|
|
||
| ```bash | ||
| ceph osd blocklist add <clientAddress+nonce> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A "problem" with OSD blocklist entries is that by default they expire after 1 hour. The expiration timeout is in seconds and can be changed by passing an additional integer argument. For example, ceph osd blocklist add 1.2.3.4/1234 86400 would persist the blocklist entry for 1 day.
If the goal here is to accommodate scenarios where the admin applies the out-of-service taint without knowing that the node is actually powered off, the blocklist entry needs to persist until it becomes known for sure that the node went through (an equivalent of) a power cycle. Allowing the blocklist entry to expire before that point can lead to "stale client access" and therefore data corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will blocklist of the max period (like 5 years) or indefinite time. @iPraveenParihar can you please update the time for that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for referenece:
ceph-csi/internal/csi-addons/networkfence/fencing.go
Lines 256 to 261 in cfb3979
| // TODO: add blocklist till infinity. | |
| // Currently, ceph does not provide the functionality to blocklist IPs | |
| // for infinite time. As a workaround, add a blocklist for 5 YEARS to | |
| // represent infinity from ceph-csi side. | |
| // At any point in this time, the IPs can be unblocked by an UnfenceClusterReq. | |
| // This needs to be updated once ceph provides functionality for the same. |
3ef2b32 to
814b583
Compare
| ## Problem | ||
|
|
||
| When `ControllerUnpublishVolume` is called without the node first having cleaned | ||
| up the volume (via `NodeUnstageVolume` and `NodeUnpublishVolume`), the CSI driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(via NodeUnstageVolumeandNodeUnpublishVolume) to (via NodeUnpublishVolumeandNodeUnstageVolume)
| - Remove the client from the blocklist: | ||
|
|
||
| ```bash | ||
| ceph osd blocklist rm <clientAddress+nonce> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the nonce as we will go with ip blocklisting?
|
|
||
| ``` | ||
| # RBD | ||
| rbd image-meta remove <pool>/<image> .rbd.csi.ceph.com/clientAddress/<NodeId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add rbd namespace as well to the command
| corresponding StorageClass at the time of provisioning: | ||
|
|
||
| ``` | ||
| csi.ceph.io/controller-publish-secret-name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we have csi.ceph.io?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mistake should be csi.storage.k8s.io/controller-publish-secret-name
| ... | ||
| } | ||
| } | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please leave a note that the node has to go through the power lifecycle before removing the taint or else we are going to have data inconsistency/corruption problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added as Important note above.
4a3c34e to
ee6f072
Compare
| csi.storage.k8s.io/controller-publish-secret-namespace | ||
| ``` | ||
|
|
||
| **Solution 1**: Fallback to default secrets if available in csi-config-map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Madhu-1, are we good with this solution to address the older PVCs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we are good with this one 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nits, LGTM
|
|
||
| > ⚠️ **WARNING**: When a node becomes out of service, its mounts and device | ||
| mappings will persist until the node goes through a complete power lifecycle | ||
| (shutdown and restart). To prevent data inconsistency or corruption, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont need restart here, just mention that power cycle( includes shutdown and start it )
| (shutdown and restart). To prevent data inconsistency or corruption, | ||
| administrators **MUST NOT** remove the `node.kubernetes.io/out-of-service` | ||
| taint until the node has successfully completed a full shutdown and restart | ||
| cycle. This ensures proper cleanup of stale device state and prevents data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full shutdown and restart cycle to full power cycle
| cycle. This ensures proper cleanup of stale device state and prevents data | ||
| corruption from lingering mounts or active client sessions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| cycle. This ensures proper cleanup of stale device state and prevents data | |
| corruption from lingering mounts or active client sessions. | |
| Removing the taint prematurely may leave stale device state, active client sessions, or lingering mounts, which can lead to serious data integrity issues. |
ee6f072 to
516221e
Compare
|
This pull request has been removed from the queue for the following reason: The pull request can't be updated. You should update or rebase your pull request manually. If you do, this pull request will automatically be requeued once the queue conditions match again. |
|
@Mergifyio rebase |
Signed-off-by: Praveen M <[email protected]>
✅ Branch has been successfully rebased |
516221e to
6df44c7
Compare
|
@Mergifyio requeue |
✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically |
Describe what this PR does
doc: design doc for non-graceful node shutdown
When
ControllerUnpublishVolumeis called without the node first having cleaned up the volume (viaNodeUnstageVolumeandNodeUnpublishVolume), the CSI driver has no opportunity to revoke the node’s access to the volume.The node may still hold active mounts, open file handles, or client sessions. This can lead to data corruption due to writes from disconnected yet still-active clients.Checklist:
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>: retest the<job-name>after unrelatedfailure (please report the failure too!)