doc: design doc for non-graceful node shutdown #5409

iPraveenParihar · 2025-07-03T08:51:30Z

Describe what this PR does

doc: design doc for non-graceful node shutdown

When ControllerUnpublishVolume is called without the node first having cleaned up the volume (via NodeUnstageVolume and NodeUnpublishVolume), the CSI driver has no opportunity to revoke the node’s access to the volume.The node may still hold active mounts, open file handles, or client sessions. This can lead to data corruption due to writes from disconnected yet still-active clients.

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
Reviewed the developer guide on Submitting a Pull Request
Pending release notes updated with breaking and/or notable changes for the next major release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

/retest ci/centos/<job-name>: retest the <job-name> after unrelated
failure (please report the failure too!)

docs/design/proposals/non-graceful-node-shutdown.md

nixpanic · 2025-07-03T10:55:15Z

docs/design/proposals/non-graceful-node-shutdown.md

@@ -0,0 +1,127 @@
+# Non graceful node shutdown
+
+In Kubernetes, when a node becomes unhealthy or is intentionally drained,


unhealthy is also used for volumes, maybe call it disfunctional?

nixpanic · 2025-07-03T10:55:31Z

docs/design/proposals/non-graceful-node-shutdown.md

+up the volume (via `NodeUnstageVolume` and `NodeUnpublishVolume`), the CSI driver
+has no opportunity to revoke the node’s access to the volume.The node may still
+hold active mounts, open file handles, or client sessions. This can lead to data
+corruption due to writes from disconnected yet still-active clients.


disconnected isn't really correct, as a disconnected client can not reach the storage. A client may re-connect later on, causing havoc. Or, the node may be somehow broken, but apps may still be running, this is the major concern.

nixpanic · 2025-07-03T10:56:55Z

docs/design/proposals/non-graceful-node-shutdown.md

+To ensure safe volume reuse and prevent stale client access during node disruptions,
+the proposed solution is to track the client address during the `NodeStageVolume()`
+operation and store it in the image or subvolume metadata under the key:
+`csi.storage.k8s.io/clientAddress/<NodeId>`.


csi.storage.k8s.io isn't appropriate, use csi.ceph.io or something similar instead

Ensure that this is not copied to the clones and snapshots.

How about below

For RBD using something like .rbd.csi.ceph.com so that we dont need to worry about the replication but we still need to take care of clone/snapshots

For cephfs use .cephfs.csi.ceph.com ?

➕ .rbd.csi.ceph.com and .cephfs.csi.ceph.com will be consistent with other metadata key as well.

ceph-csi/internal/rbd/encryption.go

Lines 57 to 66 in cfb3979

encryptionMetaKey = "rbd.csi.ceph.com/encrypted"

oldEncryptionMetaKey = ".rbd.csi.ceph.com/encrypted"

// metadataDEK is the key in the image metadata where the (encrypted)

// DEK is stored.

metadataDEK = "rbd.csi.ceph.com/dek"

oldMetadataDEK = ".rbd.csi.ceph.com/dek"

// luks2 header size metadata key.

luks2HeaderSizeKey = "rbd.csi.ceph.com/luks2HeaderSize"

docs/design/proposals/non-graceful-node-shutdown.md

nixpanic · 2025-07-03T10:59:28Z

docs/design/proposals/non-graceful-node-shutdown.md

+csi.storage.k8s.io/controller-publish-secret-namespace
+```
+
+**Solution 1**: Fetch secrets somehow using volumeID.


Madhu-1 · 2025-07-03T11:34:09Z

docs/design/proposals/non-graceful-node-shutdown.md

+To ensure safe volume reuse and prevent stale client access during node disruptions,
+the proposed solution is to track the client address during the `NodeStageVolume()`
+operation and store it in the image or subvolume metadata under the key:
+`csi.storage.k8s.io/clientAddress/<NodeId>`.


Ensure that this is not copied to the clones and snapshots.

Madhu-1 · 2025-07-03T11:37:47Z

docs/design/proposals/non-graceful-node-shutdown.md

+    (`NodeId` from the plugin container argument `--nodeid`)
+
+    ```
+    csi.storage.k8s.io/clientAddress/<NodeId>: <clientAddress+nonce>


This steps is missing the details to get the clientAdress+nonce.

@Madhu-1, does this get us the required address?

ceph-csi/internal/util/connection.go

Lines 154 to 163 in cfb3979

// GetAddrs returns the addresses of the RADOS session,

// suitable for blocklisting.

func (cc *ClusterConnection) GetAddrs() (string, error) {

if cc.conn == nil {

return "", errors.New("cluster is not connected yet")

}

return cc.conn.GetAddrs()

}

it will give us the IP address not nonce

Ahh okay, from comment I thought we get the IP+nonce

ceph-csi/internal/csi-addons/rbd/network_fence.go

Lines 153 to 163 in cfb3979

address, err := conn.GetAddrs()

if err != nil {

return nil, status.Errorf(codes.Internal, "failed to get client address: %s", err)

}

// The example address we get is 10.244.0.1:0/2686266785 from

// which we need to extract the IP address.

addr, err := nf.ParseClientIP(address)

if err != nil {

return nil, status.Errorf(codes.Internal, "failed to parse client address: %s", err)

}

For RBD, we can get the clientAddres+nonce from below node path

[root@c1-m03 /]# cat /sys/devices/rbd/0/client_addr 192.168.39.34:0/2458835906

suggested by @nixpanic

@Madhu-1, does this get us the required address?

ceph-csi/internal/util/connection.go

Lines 154 to 163 in cfb3979

// GetAddrs returns the addresses of the RADOS session,

// suitable for blocklisting.

func (cc *ClusterConnection) GetAddrs() (string, error) {

if cc.conn == nil {

return "", errors.New("cluster is not connected yet")

}

return cc.conn.GetAddrs()

}

I tried this, but it returns a different nonce, IP being same.
The method mentioned earlier in previous comment only applies to block devices. (/sys/devices/rbd/0/client_addr).
I haven't found any other way to retrieve the IP and nonce together except calling rbd status/ceph tell API calls.

That said, I don't see any difference between blocklisting by just IP vs IP+nonce. Since the primary goal is to block the access from the tainted node, so blocklisting by IP alone should suffice?

^^ @Madhu-1 @nixpanic @Rakshith-R

Madhu-1 · 2025-07-03T11:39:09Z

docs/design/proposals/non-graceful-node-shutdown.md

+      ```
+
+    - **For CephFS**:
+      - List active clients and match against `clientAddress` to get the `clientId`.


Are we not going to store it in the metadata?

we could store the clientId for CephFs subvolumes, that would add one extra metadata entry but could save us from listing the active clients and matching operation for each subvolume request.

idryomov · 2025-07-04T09:00:05Z

docs/design/proposals/non-graceful-node-shutdown.md

+has no opportunity to revoke the node's access to the volume. The node may still
+hold active mounts, open file handles, or client sessions. This can lead to data
+corruption as applications may still be running on the broken node with active
+client sessions, even though the node is marked as out of service.


https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown says that node.kubernetes.io/out-of-service taint should be added only if "the node is already in shutdown or power off state (not in the middle of restarting)". How can there be active client sessions and applications still be running there if the node is physically powered off?

@idryomov we are trying to cover the worst case where we lost access to the nodes and we want to block the client on that node to avoid the problems to be on safer side. When the taint is added the kubernetes assumes that node is already in power off state and starts moving the work load to other nodes, we are trying to have more fencing mechanism to ensure that we are fencing off the node to avoid the access to the client (if they run or ever comes back when the taint still exists)

What about the point when the taint is removed? Since the admin can apply the taint without knowing that the node is actually powered off (and that being precisely what this design is trying to guard against), I'm wondering whether the admin can also remove the taint whenever they feel like it?

Put differently, is it enforced that the node goes through (an equivalent of) a power cycle before the taint is removed?

When the node is recovered and admin can access the node, If the node is available (kubelet+api-server) ensure that all pods are removed from the node (to keep it consistent with the ETCD data of pods) and no new pods are scheduled on the node.

When everything is fine the admin can remove the taint and mark the node available for new pods scheduling. The expectation is that the admin will remove the taints when there are no pods running on the node. As we are implementing this in the RPC spec. CSI takes care of unfencing (when a pod is scheduled for the node) and allow the future operations.

Does the term "recovered" include a power cycle?

No pods running on the node is very different from (an equivalent of) a power cycle. Let's say the taint gets added on an active/running node with a bunch of mapped RBD devices and the corresponding mounts. Later, the admin realizes that they screwed up and decides to remove the taint. What is supposed to ensure that those mounts are teared down before Ceph CSI unfences?

The kubelet will take care of it as soon the node comes back.

Can you go into more detail here? How exactly does kubelet do that? If one of the steps there fails for some reason, is the admin prevented from removing the taint?

And later its admin responsibility to power of the node as well.

... but this isn't enforced?

The kubelet will take care of it as soon the node comes back.

Can you go into more detail here? How exactly does kubelet do that? If one of the steps there fails for some reason, is the admin prevented from removing the taint?

I havent looked into the kubelet code but we can check on that one @iPraveenParihar PTAL. Not yet, kubernetes doesnt have any checks before removing the taint, its left to the admin to validate everything and remove the taint (we can also document if anything is required from our side as well)

And later its admin responsibility to power of the node as well.

... but this isn't enforced?

yes nothing is enforced in kubernetes its only documented step.

From what I have tested...

Let's say the taint gets added on an active/running node with a bunch of mapped RBD devices and the corresponding mounts. Later, the admin realizes that they screwed up and decides to remove the taint. What is supposed to ensure that those mounts are teared down before Ceph CSI unfences?

Scenario 1: Node is healthy (Node is Ready, kubelet is running), taints are removed without reboot.

If a taint is added to a healthy node where the kubelet is functional, existing pods are marked for deletion but stay in a Terminating or Error state.

These pods aren’t forcefully deleted because the node's status is not NotReady.

NodeUnpublishVolume and NodeUnstageVolume are not called, since the CSI nodeplugin pods are also removed.

ControllerUnpublishVolume is not triggered either — VolumeAttachment objects stay intact until pods are forcefully deleted.

New pods scheduled to another node will fail with attach errors due to the image still being mapped.

The admin realizes that they screwed up and decides to remove the taint, nodeplugin pods come back up, kubelet issues the pending NodeUnpublishVolume/NodeUnstageVolume calls for the pod/s in terminating/error state, mappings are removed, and new pods on the other node can finally attach the volume successfully.

I'm trying to explore the scenario described in this doc -- one where NodeUnstageVolume and NodeUnpublishVolume calls either aren't issued at all or don't make it and only ControllerUnpublishVolume goes in effect. If the node isn't powered off, the mounts would still be there and would be left behind, right? My understanding is that only blocklisting would occur as part of handling ControllerUnpublishVolume.

Scenario 2: Node is unresponsive (Node is NotReady, kubelet is not running), taints are removed without reboot

The unresponsive node is tainted.

Pods are forcefully deleted.

VolumeAttachment deletion triggers ControllerUnpublishVolume, which blocks the RBD image.

New pods are scheduled on other nodes and can attach without issue.

Here, now if the node comes back without a reboot and Kubelet is running, NodeUnpublishVolume/NodeUnstageVolume would not be called as the pod/s is already deleted. Since the node hasn’t rebooted, the original RBD device mappings and mount points still persist.

The kubelet will take care of it as soon the node comes back. And later its admin responsibility to power of the node as well. i hope @iPraveenParihar already tested this case as well.

@Madhu-1, the mounts and mappings will persist unless the node is rebooted. That’s why we discussed that the admin should only remove the taint after the node has come back from a successful shutdown. This ensures clean teardown and avoids stale device state.

Here, now if the node comes back without a reboot and Kubelet is running, NodeUnpublishVolume/NodeUnstageVolume would not be called as the pod/s is already deleted. Since the node hasn’t rebooted, the original RBD device mappings and mount points still persist.

This is exactly what I suspected and why I kept probing here ;) This should be obvious by now but I want to restate it just in case: if the node a) isn't properly cleaned up (i.e. all mounts and RBD device mappings are teared down via NodeUnpublishVolume/NodeUnstageVolume) or b) doesn't go through a power cycle to waive any concerns around cleanup, removing the blocklist entry in ControllerPublishVolume can lead to data corruption. This was added after my earlier comment

# The blocklist must persist until we can confirm the node has gone through # a complete power cycle, as premature expiration could lead to data corruption

but not in the main body of the doc and previously wasn't mentioned at all. I'd suggest highlighting this in a separate paragraph.

Thanks @idryomov for probing on this one to ensure we dont miss anything and we document it :) , @iPraveenParihar Thanks for confirming as my memory is really old for this as this exists for many releases and i tested very long back. We need to document this in our documents what is the exception from the ceph-csi to avoid any problems related to data.

idryomov · 2025-07-07T09:22:50Z

docs/design/proposals/non-graceful-node-shutdown.md

+    - **For RBD**:
+
+      ```bash
+      ceph osd blocklist add <clientAddress+nonce>


A "problem" with OSD blocklist entries is that by default they expire after 1 hour. The expiration timeout is in seconds and can be changed by passing an additional integer argument. For example, ceph osd blocklist add 1.2.3.4/1234 86400 would persist the blocklist entry for 1 day.

If the goal here is to accommodate scenarios where the admin applies the out-of-service taint without knowing that the node is actually powered off, the blocklist entry needs to persist until it becomes known for sure that the node went through (an equivalent of) a power cycle. Allowing the blocklist entry to expire before that point can lead to "stale client access" and therefore data corruption.

we will blocklist of the max period (like 5 years) or indefinite time. @iPraveenParihar can you please update the time for that as well.

just for referenece:

ceph-csi/internal/csi-addons/networkfence/fencing.go

Lines 256 to 261 in cfb3979

// TODO: add blocklist till infinity.

// Currently, ceph does not provide the functionality to blocklist IPs

// for infinite time. As a workaround, add a blocklist for 5 YEARS to

// represent infinity from ceph-csi side.

// At any point in this time, the IPs can be unblocked by an UnfenceClusterReq.

// This needs to be updated once ceph provides functionality for the same.

Madhu-1 · 2025-07-08T09:54:19Z

docs/design/proposals/non-graceful-node-shutdown.md

+## Problem
+
+When `ControllerUnpublishVolume` is called without the node first having cleaned
+up the volume (via `NodeUnstageVolume` and `NodeUnpublishVolume`), the CSI driver


(via NodeUnstageVolumeandNodeUnpublishVolume) to (via NodeUnpublishVolumeandNodeUnstageVolume)

Madhu-1 · 2025-07-08T09:55:33Z

docs/design/proposals/non-graceful-node-shutdown.md

+    - Remove the client from the blocklist:
+
+    ```bash
+    ceph osd blocklist rm <clientAddress+nonce>


remove the nonce as we will go with ip blocklisting?

Madhu-1 · 2025-07-08T09:56:02Z

docs/design/proposals/non-graceful-node-shutdown.md

+
+    ```
+    # RBD
+    rbd image-meta remove <pool>/<image> .rbd.csi.ceph.com/clientAddress/<NodeId>


add rbd namespace as well to the command

Madhu-1 · 2025-07-08T09:56:50Z

docs/design/proposals/non-graceful-node-shutdown.md

+corresponding StorageClass at the time of provisioning:
+
+```
+csi.ceph.io/controller-publish-secret-name


why we have csi.ceph.io?

mistake should be csi.storage.k8s.io/controller-publish-secret-name

Madhu-1 · 2025-07-08T09:58:00Z

docs/design/proposals/non-graceful-node-shutdown.md

+      ...
+    }
+  }
+]


Can you please leave a note that the node has to go through the power lifecycle before removing the taint or else we are going to have data inconsistency/corruption problem

Added as Important note above.

iPraveenParihar · 2025-07-08T12:36:57Z

docs/design/proposals/non-graceful-node-shutdown.md

+csi.storage.k8s.io/controller-publish-secret-namespace
+```
+
+**Solution 1**: Fallback to default secrets if available in csi-config-map


@Madhu-1, are we good with this solution to address the older PVCs?

Yes we are good with this one 👍🏻

Rakshith-R

LGTM

Madhu-1

small nits, LGTM

Madhu-1 · 2025-07-09T08:40:45Z

docs/design/proposals/non-graceful-node-shutdown.md

+
+> ⚠️ **WARNING**: When a node becomes out of service, its mounts and device
+mappings will persist until the node goes through a complete power lifecycle
+(shutdown and restart). To prevent data inconsistency or corruption,


we dont need restart here, just mention that power cycle( includes shutdown and start it )

Madhu-1 · 2025-07-09T08:41:44Z

docs/design/proposals/non-graceful-node-shutdown.md

+(shutdown and restart). To prevent data inconsistency or corruption,
+administrators **MUST NOT** remove the `node.kubernetes.io/out-of-service`
+taint until the node has successfully completed a full shutdown and restart
+cycle. This ensures proper cleanup of stale device state and prevents data


full shutdown and restart cycle to full power cycle

Madhu-1 · 2025-07-09T08:43:20Z

docs/design/proposals/non-graceful-node-shutdown.md

+cycle. This ensures proper cleanup of stale device state and prevents data
+corruption from lingering mounts or active client sessions.


Suggested change

cycle. This ensures proper cleanup of stale device state and prevents data

corruption from lingering mounts or active client sessions.

Removing the taint prematurely may leave stale device state, active client sessions, or lingering mounts, which can lead to serious data integrity issues.

Pull request has been modified.

mergify · 2025-07-09T10:02:03Z

This pull request has been removed from the queue for the following reason: pull request branch update failed.

The pull request can't be updated.

You should update or rebase your pull request manually. If you do, this pull request will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

Rakshith-R · 2025-07-09T10:03:11Z

@Mergifyio rebase

Signed-off-by: Praveen M <[email protected]>

mergify · 2025-07-09T10:03:21Z

rebase

✅ Branch has been successfully rebased

Rakshith-R · 2025-07-09T10:03:33Z

@Mergifyio requeue

mergify · 2025-07-09T10:03:40Z

requeue

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

mergify bot added ci/skip/e2e skip running e2e CI jobs ci/skip/multi-arch-build skip building on multiple architectures component/docs Issues and PRs related to documentation labels Jul 3, 2025

iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch from 738c3b0 to 92347a4 Compare July 3, 2025 08:53

iPraveenParihar marked this pull request as ready for review July 3, 2025 08:53

iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch from 92347a4 to 8513a0c Compare July 3, 2025 09:02

nixpanic reviewed Jul 3, 2025

View reviewed changes

Madhu-1 reviewed Jul 3, 2025

View reviewed changes

iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch from 8513a0c to fe22844 Compare July 4, 2025 05:18

idryomov reviewed Jul 4, 2025

View reviewed changes

idryomov reviewed Jul 7, 2025

View reviewed changes

iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch 2 times, most recently from 3ef2b32 to 814b583 Compare July 7, 2025 10:07

iPraveenParihar requested review from Madhu-1 and nixpanic July 8, 2025 09:38

Madhu-1 reviewed Jul 8, 2025

View reviewed changes

iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch 9 times, most recently from 4a3c34e to ee6f072 Compare July 8, 2025 12:33

iPraveenParihar commented Jul 8, 2025

View reviewed changes

Rakshith-R previously approved these changes Jul 8, 2025

View reviewed changes

Madhu-1 reviewed Jul 9, 2025

View reviewed changes

iPraveenParihar force-pushed the design/handle-non-graceful-node-shutdown branch from ee6f072 to 516221e Compare July 9, 2025 09:44

Madhu-1 approved these changes Jul 9, 2025

View reviewed changes

Madhu-1 requested a review from Rakshith-R July 9, 2025 09:48

Rakshith-R approved these changes Jul 9, 2025

View reviewed changes

doc: design doc for non-graceful node shutdown

6df44c7

Signed-off-by: Praveen M <[email protected]>

Rakshith-R force-pushed the design/handle-non-graceful-node-shutdown branch from 516221e to 6df44c7 Compare July 9, 2025 10:03

mergify bot merged commit 4870de1 into ceph:devel Jul 9, 2025
15 checks passed

iPraveenParihar mentioned this pull request Jul 16, 2025

Handle non graceful node shutdown #5429

Merged

6 tasks

iPraveenParihar mentioned this pull request Jul 23, 2025

api: ClientProfile default controller publish secret ceph/ceph-csi-operator#280

Merged

6 tasks

		@@ -0,0 +1,127 @@
		# Non graceful node shutdown

		In Kubernetes, when a node becomes unhealthy or is intentionally drained,

	encryptionMetaKey = "rbd.csi.ceph.com/encrypted"
	oldEncryptionMetaKey = ".rbd.csi.ceph.com/encrypted"

	// metadataDEK is the key in the image metadata where the (encrypted)
	// DEK is stored.
	metadataDEK = "rbd.csi.ceph.com/dek"
	oldMetadataDEK = ".rbd.csi.ceph.com/dek"

	// luks2 header size metadata key.
	luks2HeaderSizeKey = "rbd.csi.ceph.com/luks2HeaderSize"


	// GetAddrs returns the addresses of the RADOS session,
	// suitable for blocklisting.
	func (cc *ClusterConnection) GetAddrs() (string, error) {
	if cc.conn == nil {
	return "", errors.New("cluster is not connected yet")
	}

	return cc.conn.GetAddrs()
	}

	address, err := conn.GetAddrs()
	if err != nil {
	return nil, status.Errorf(codes.Internal, "failed to get client address: %s", err)
	}

	// The example address we get is 10.244.0.1:0/2686266785 from
	// which we need to extract the IP address.
	addr, err := nf.ParseClientIP(address)
	if err != nil {
	return nil, status.Errorf(codes.Internal, "failed to parse client address: %s", err)
	}

	// TODO: add blocklist till infinity.
	// Currently, ceph does not provide the functionality to blocklist IPs
	// for infinite time. As a workaround, add a blocklist for 5 YEARS to
	// represent infinity from ceph-csi side.
	// At any point in this time, the IPs can be unblocked by an UnfenceClusterReq.
	// This needs to be updated once ceph provides functionality for the same.

		cycle. This ensures proper cleanup of stale device state and prevents data
		corruption from lingering mounts or active client sessions.

	cycle. This ensures proper cleanup of stale device state and prevents data
	corruption from lingering mounts or active client sessions.
	Removing the taint prematurely may leave stale device state, active client sessions, or lingering mounts, which can lead to serious data integrity issues.

Uh oh!

doc: design doc for non-graceful node shutdown #5409

doc: design doc for non-graceful node shutdown #5409

Uh oh!

Conversation

iPraveenParihar commented Jul 3, 2025

Describe what this PR does

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iPraveenParihar Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Scenario 1: Node is healthy (Node is Ready, kubelet is running), taints are removed without reboot.

Scenario 2: Node is unresponsive (Node is NotReady, kubelet is not running), taints are removed without reboot

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iPraveenParihar Jul 7, 2025 •

edited

Loading