Skip to content

Conversation

@iPraveenParihar
Copy link
Contributor

Describe what this PR does

doc: design doc for non-graceful node shutdown

When ControllerUnpublishVolume is called without the node first having cleaned up the volume (via NodeUnstageVolume and NodeUnpublishVolume), the CSI driver has no opportunity to revoke the node’s access to the volume.The node may still hold active mounts, open file handles, or client sessions. This can lead to data corruption due to writes from disconnected yet still-active clients.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next major release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)

@mergify mergify bot added ci/skip/e2e skip running e2e CI jobs ci/skip/multi-arch-build skip building on multiple architectures component/docs Issues and PRs related to documentation labels Jul 3, 2025
@iPraveenParihar iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch from 738c3b0 to 92347a4 Compare July 3, 2025 08:53
@iPraveenParihar iPraveenParihar marked this pull request as ready for review July 3, 2025 08:53
@iPraveenParihar iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch from 92347a4 to 8513a0c Compare July 3, 2025 09:02
@@ -0,0 +1,127 @@
# Non graceful node shutdown

In Kubernetes, when a node becomes unhealthy or is intentionally drained,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unhealthy is also used for volumes, maybe call it disfunctional?

up the volume (via `NodeUnstageVolume` and `NodeUnpublishVolume`), the CSI driver
has no opportunity to revoke the node’s access to the volume.The node may still
hold active mounts, open file handles, or client sessions. This can lead to data
corruption due to writes from disconnected yet still-active clients.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disconnected isn't really correct, as a disconnected client can not reach the storage. A client may re-connect later on, causing havoc. Or, the node may be somehow broken, but apps may still be running, this is the major concern.

To ensure safe volume reuse and prevent stale client access during node disruptions,
the proposed solution is to track the client address during the `NodeStageVolume()`
operation and store it in the image or subvolume metadata under the key:
`csi.storage.k8s.io/clientAddress/<NodeId>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

csi.storage.k8s.io isn't appropriate, use csi.ceph.io or something similar instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that this is not copied to the clones and snapshots.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about below

  • For RBD using something like .rbd.csi.ceph.com so that we dont need to worry about the replication but we still need to take care of clone/snapshots
  • For cephfs use .cephfs.csi.ceph.com ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.rbd.csi.ceph.com and .cephfs.csi.ceph.com will be consistent with other metadata key as well.

encryptionMetaKey = "rbd.csi.ceph.com/encrypted"
oldEncryptionMetaKey = ".rbd.csi.ceph.com/encrypted"
// metadataDEK is the key in the image metadata where the (encrypted)
// DEK is stored.
metadataDEK = "rbd.csi.ceph.com/dek"
oldMetadataDEK = ".rbd.csi.ceph.com/dek"
// luks2 header size metadata key.
luks2HeaderSizeKey = "rbd.csi.ceph.com/luks2HeaderSize"

csi.storage.k8s.io/controller-publish-secret-namespace
```

**Solution 1**: Fetch secrets somehow using volumeID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how?

To ensure safe volume reuse and prevent stale client access during node disruptions,
the proposed solution is to track the client address during the `NodeStageVolume()`
operation and store it in the image or subvolume metadata under the key:
`csi.storage.k8s.io/clientAddress/<NodeId>`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that this is not copied to the clones and snapshots.

(`NodeId` from the plugin container argument `--nodeid`)

```
csi.storage.k8s.io/clientAddress/<NodeId>: <clientAddress+nonce>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This steps is missing the details to get the clientAdress+nonce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1, does this get us the required address?

// GetAddrs returns the addresses of the RADOS session,
// suitable for blocklisting.
func (cc *ClusterConnection) GetAddrs() (string, error) {
if cc.conn == nil {
return "", errors.New("cluster is not connected yet")
}
return cc.conn.GetAddrs()
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will give us the IP address not nonce

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay, from comment I thought we get the IP+nonce

address, err := conn.GetAddrs()
if err != nil {
return nil, status.Errorf(codes.Internal, "failed to get client address: %s", err)
}
// The example address we get is 10.244.0.1:0/2686266785 from
// which we need to extract the IP address.
addr, err := nf.ParseClientIP(address)
if err != nil {
return nil, status.Errorf(codes.Internal, "failed to parse client address: %s", err)
}

For RBD, we can get the clientAddres+nonce from below node path

[root@c1-m03 /]# cat /sys/devices/rbd/0/client_addr
192.168.39.34:0/2458835906

suggested by @nixpanic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1, does this get us the required address?

// GetAddrs returns the addresses of the RADOS session,
// suitable for blocklisting.
func (cc *ClusterConnection) GetAddrs() (string, error) {
if cc.conn == nil {
return "", errors.New("cluster is not connected yet")
}
return cc.conn.GetAddrs()
}

I tried this, but it returns a different nonce, IP being same.
The method mentioned earlier in previous comment only applies to block devices. (/sys/devices/rbd/0/client_addr).
I haven't found any other way to retrieve the IP and nonce together except calling rbd status/ceph tell API calls.

That said, I don't see any difference between blocklisting by just IP vs IP+nonce. Since the primary goal is to block the access from the tainted node, so blocklisting by IP alone should suffice?

^^ @Madhu-1 @nixpanic @Rakshith-R

```

- **For CephFS**:
- List active clients and match against `clientAddress` to get the `clientId`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not going to store it in the metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could store the clientId for CephFs subvolumes, that would add one extra metadata entry but could save us from listing the active clients and matching operation for each subvolume request.

@iPraveenParihar iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch from 8513a0c to fe22844 Compare July 4, 2025 05:18
has no opportunity to revoke the node's access to the volume. The node may still
hold active mounts, open file handles, or client sessions. This can lead to data
corruption as applications may still be running on the broken node with active
client sessions, even though the node is marked as out of service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown says that node.kubernetes.io/out-of-service taint should be added only if "the node is already in shutdown or power off state (not in the middle of restarting)". How can there be active client sessions and applications still be running there if the node is physically powered off?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@idryomov we are trying to cover the worst case where we lost access to the nodes and we want to block the client on that node to avoid the problems to be on safer side. When the taint is added the kubernetes assumes that node is already in power off state and starts moving the work load to other nodes, we are trying to have more fencing mechanism to ensure that we are fencing off the node to avoid the access to the client (if they run or ever comes back when the taint still exists)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the point when the taint is removed? Since the admin can apply the taint without knowing that the node is actually powered off (and that being precisely what this design is trying to guard against), I'm wondering whether the admin can also remove the taint whenever they feel like it?

Put differently, is it enforced that the node goes through (an equivalent of) a power cycle before the taint is removed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the node is recovered and admin can access the node, If the node is available (kubelet+api-server) ensure that all pods are removed from the node (to keep it consistent with the ETCD data of pods) and no new pods are scheduled on the node.

When everything is fine the admin can remove the taint and mark the node available for new pods scheduling. The expectation is that the admin will remove the taints when there are no pods running on the node. As we are implementing this in the RPC spec. CSI takes care of unfencing (when a pod is scheduled for the node) and allow the future operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the term "recovered" include a power cycle?

No pods running on the node is very different from (an equivalent of) a power cycle. Let's say the taint gets added on an active/running node with a bunch of mapped RBD devices and the corresponding mounts. Later, the admin realizes that they screwed up and decides to remove the taint. What is supposed to ensure that those mounts are teared down before Ceph CSI unfences?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubelet will take care of it as soon the node comes back.

Can you go into more detail here? How exactly does kubelet do that? If one of the steps there fails for some reason, is the admin prevented from removing the taint?

And later its admin responsibility to power of the node as well.

... but this isn't enforced?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubelet will take care of it as soon the node comes back.

Can you go into more detail here? How exactly does kubelet do that? If one of the steps there fails for some reason, is the admin prevented from removing the taint?

I havent looked into the kubelet code but we can check on that one @iPraveenParihar PTAL. Not yet, kubernetes doesnt have any checks before removing the taint, its left to the admin to validate everything and remove the taint (we can also document if anything is required from our side as well)

And later its admin responsibility to power of the node as well.

... but this isn't enforced?

yes nothing is enforced in kubernetes its only documented step.

Copy link
Contributor Author

@iPraveenParihar iPraveenParihar Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I have tested...

Let's say the taint gets added on an active/running node with a bunch of mapped RBD devices and the corresponding mounts. Later, the admin realizes that they screwed up and decides to remove the taint. What is supposed to ensure that those mounts are teared down before Ceph CSI unfences?

Scenario 1: Node is healthy (Node is Ready, kubelet is running), taints are removed without reboot.

  • If a taint is added to a healthy node where the kubelet is functional, existing pods are marked for deletion but stay in a Terminating or Error state.

  • These pods aren’t forcefully deleted because the node's status is not NotReady.

  • NodeUnpublishVolume and NodeUnstageVolume are not called, since the CSI nodeplugin pods are also removed.

  • ControllerUnpublishVolume is not triggered either — VolumeAttachment objects stay intact until pods are forcefully deleted.

  • New pods scheduled to another node will fail with attach errors due to the image still being mapped.

  • The admin realizes that they screwed up and decides to remove the taint, nodeplugin pods come back up, kubelet issues the pending NodeUnpublishVolume/NodeUnstageVolume calls for the pod/s in terminating/error state, mappings are removed, and new pods on the other node can finally attach the volume successfully.

I'm trying to explore the scenario described in this doc -- one where NodeUnstageVolume and NodeUnpublishVolume calls either aren't issued at all or don't make it and only ControllerUnpublishVolume goes in effect. If the node isn't powered off, the mounts would still be there and would be left behind, right? My understanding is that only blocklisting would occur as part of handling ControllerUnpublishVolume.

Scenario 2: Node is unresponsive (Node is NotReady, kubelet is not running), taints are removed without reboot

  • The unresponsive node is tainted.

  • Pods are forcefully deleted.

  • VolumeAttachment deletion triggers ControllerUnpublishVolume, which blocks the RBD image.

  • New pods are scheduled on other nodes and can attach without issue.

  • Here, now if the node comes back without a reboot and Kubelet is running, NodeUnpublishVolume/NodeUnstageVolume would not be called as the pod/s is already deleted. Since the node hasn’t rebooted, the original RBD device mappings and mount points still persist.

The kubelet will take care of it as soon the node comes back. And later its admin responsibility to power of the node as well. i hope @iPraveenParihar already tested this case as well.

@Madhu-1, the mounts and mappings will persist unless the node is rebooted. That’s why we discussed that the admin should only remove the taint after the node has come back from a successful shutdown. This ensures clean teardown and avoids stale device state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, now if the node comes back without a reboot and Kubelet is running, NodeUnpublishVolume/NodeUnstageVolume would not be called as the pod/s is already deleted. Since the node hasn’t rebooted, the original RBD device mappings and mount points still persist.

This is exactly what I suspected and why I kept probing here ;) This should be obvious by now but I want to restate it just in case: if the node a) isn't properly cleaned up (i.e. all mounts and RBD device mappings are teared down via NodeUnpublishVolume/NodeUnstageVolume) or b) doesn't go through a power cycle to waive any concerns around cleanup, removing the blocklist entry in ControllerPublishVolume can lead to data corruption. This was added after my earlier comment

       # The blocklist must persist until we can confirm the node has gone through
       # a complete power cycle, as premature expiration could lead to data corruption

but not in the main body of the doc and previously wasn't mentioned at all. I'd suggest highlighting this in a separate paragraph.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @idryomov for probing on this one to ensure we dont miss anything and we document it :) , @iPraveenParihar Thanks for confirming as my memory is really old for this as this exists for many releases and i tested very long back. We need to document this in our documents what is the exception from the ceph-csi to avoid any problems related to data.

- **For RBD**:

```bash
ceph osd blocklist add <clientAddress+nonce>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "problem" with OSD blocklist entries is that by default they expire after 1 hour. The expiration timeout is in seconds and can be changed by passing an additional integer argument. For example, ceph osd blocklist add 1.2.3.4/1234 86400 would persist the blocklist entry for 1 day.

If the goal here is to accommodate scenarios where the admin applies the out-of-service taint without knowing that the node is actually powered off, the blocklist entry needs to persist until it becomes known for sure that the node went through (an equivalent of) a power cycle. Allowing the blocklist entry to expire before that point can lead to "stale client access" and therefore data corruption.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will blocklist of the max period (like 5 years) or indefinite time. @iPraveenParihar can you please update the time for that as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for referenece:

// TODO: add blocklist till infinity.
// Currently, ceph does not provide the functionality to blocklist IPs
// for infinite time. As a workaround, add a blocklist for 5 YEARS to
// represent infinity from ceph-csi side.
// At any point in this time, the IPs can be unblocked by an UnfenceClusterReq.
// This needs to be updated once ceph provides functionality for the same.

@iPraveenParihar iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch 2 times, most recently from 3ef2b32 to 814b583 Compare July 7, 2025 10:07
## Problem

When `ControllerUnpublishVolume` is called without the node first having cleaned
up the volume (via `NodeUnstageVolume` and `NodeUnpublishVolume`), the CSI driver
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(via NodeUnstageVolumeandNodeUnpublishVolume) to (via NodeUnpublishVolumeandNodeUnstageVolume)

- Remove the client from the blocklist:

```bash
ceph osd blocklist rm <clientAddress+nonce>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the nonce as we will go with ip blocklisting?


```
# RBD
rbd image-meta remove <pool>/<image> .rbd.csi.ceph.com/clientAddress/<NodeId>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add rbd namespace as well to the command

corresponding StorageClass at the time of provisioning:

```
csi.ceph.io/controller-publish-secret-name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we have csi.ceph.io?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mistake should be csi.storage.k8s.io/controller-publish-secret-name

...
}
}
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please leave a note that the node has to go through the power lifecycle before removing the taint or else we are going to have data inconsistency/corruption problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as Important note above.

@iPraveenParihar iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch 9 times, most recently from 4a3c34e to ee6f072 Compare July 8, 2025 12:33
csi.storage.k8s.io/controller-publish-secret-namespace
```

**Solution 1**: Fallback to default secrets if available in csi-config-map
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1, are we good with this solution to address the older PVCs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we are good with this one 👍🏻

Rakshith-R
Rakshith-R previously approved these changes Jul 8, 2025
Copy link
Contributor

@Rakshith-R Rakshith-R left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nits, LGTM


> ⚠️ **WARNING**: When a node becomes out of service, its mounts and device
mappings will persist until the node goes through a complete power lifecycle
(shutdown and restart). To prevent data inconsistency or corruption,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need restart here, just mention that power cycle( includes shutdown and start it )

(shutdown and restart). To prevent data inconsistency or corruption,
administrators **MUST NOT** remove the `node.kubernetes.io/out-of-service`
taint until the node has successfully completed a full shutdown and restart
cycle. This ensures proper cleanup of stale device state and prevents data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full shutdown and restart cycle to full power cycle

Comment on lines 29 to 30
cycle. This ensures proper cleanup of stale device state and prevents data
corruption from lingering mounts or active client sessions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cycle. This ensures proper cleanup of stale device state and prevents data
corruption from lingering mounts or active client sessions.
Removing the taint prematurely may leave stale device state, active client sessions, or lingering mounts, which can lead to serious data integrity issues.

@iPraveenParihar iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch from ee6f072 to 516221e Compare July 9, 2025 09:44
@mergify mergify bot dismissed Rakshith-R’s stale review July 9, 2025 09:45

Pull request has been modified.

@Madhu-1 Madhu-1 requested a review from Rakshith-R July 9, 2025 09:48
@mergify
Copy link
Contributor

mergify bot commented Jul 9, 2025

This pull request has been removed from the queue for the following reason: pull request branch update failed.

The pull request can't be updated.

You should update or rebase your pull request manually. If you do, this pull request will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

@Rakshith-R
Copy link
Contributor

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Jul 9, 2025

rebase

✅ Branch has been successfully rebased

@Rakshith-R Rakshith-R force-pushed the design/handle-non-graceful-node-shutdown branch from 516221e to 6df44c7 Compare July 9, 2025 10:03
@Rakshith-R
Copy link
Contributor

@Mergifyio requeue

@mergify
Copy link
Contributor

mergify bot commented Jul 9, 2025

requeue

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

@mergify mergify bot merged commit 4870de1 into ceph:devel Jul 9, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/skip/e2e skip running e2e CI jobs ci/skip/multi-arch-build skip building on multiple architectures component/docs Issues and PRs related to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants