Decom old nodes: CORE-7111 by joe-redpanda · Pull Request #28946 · redpanda-data/redpanda

joe-redpanda · 2025-12-11T22:58:17Z

Feature PR for allowing brokers to automatically decommission brokers which have been unavailable for a certain timeout.

Adds last seen to a new report in the nodewise health report.

Processes last seen reports to find nodes which are past the decommission timeout on a quorum of nodes.

This creates a list of auto decom candidates, one of which will be selected (the lowers node id) to be automatically decommissioned. Nothing will be submitted for auto decommission if theres is an ongoing decommission.

Completes https://redpandadata.atlassian.net/browse/CORE-7111

Backports Required

Release Notes

Improvements

Allow nodes to automatically decommission after a certain timeout

joe-redpanda · 2026-01-08T19:02:07Z

Rebase onto dev to cut my build times, no relevant changes

Adds a convenience formatter for flat hash set s.t. it can be easily logged.

Adds partition_autobalancing_node_autodecommission_timeout_sec which is the time in seconds after which partition balancer planner should begin decommissioning a node which is unresponsive.

Wires partition_autobalancing_node_autodecommission_time into partition balancer. This commit adds the basicmost implementation of auto decommissioning which is based on the last seen from the perspective of the current controller broker. This implementation will run into problems when controller leadership changes. In future commits, this will be changed for a coordinated approach where the partition_balancer_planner will instead use the cluster health report to seek the consent of a quorum of nodes before decommissioning a broker.

adds node_status to health monitor backend. This will be used in future commits to create an auto decom status report

Adds a new struct to health_monitor_types: node_liveness_report. This part of the health report will detail internode connectivity by detailing when each node has last heard from every other node Adds this struct to node_health_report and node_health_report_serde. Adds build fixes needed given the above.

Adds population of node liveness report from the node_status_table. This will iterate over all cluster members, fetching their last seen from the node status table (if present) and populating the node_liveness_report with the results.

to config Adds node_autodecommission_timeout to parition_balancer_planner::planner_config such that it can later be used in auto decommission logic. Adds necessary build fixes for this.

Adds the logic which will perform coordinated auto decommisson of nodes which have elapsed their auto decom timeout. Now, nodes will only be slated for auto decommission if a majority of the nodes in the cluster have indicated that the node is derelict, by sending a node_liveness_report where the node in question has been unresponsive for longer than the autodecommission timeout. partition_balancer_planner will not attempt to auto decom a node if any node is currently decommmissioning it will only attempt to decommission one node at a time instead of decommissioning all nodes it finds to be derelict

Adds unit tests for the partition_balancer_planner's logic of determining when to auto decommisssion a node.

Adds two tests. 1. smoke test: check that we can auto decom a node if it elapses the auto decom timeout 2. reset test: check that node restarts DO reset the timer on auto decommissioning

dotnwat

@joe-redpanda please update this PR cover letter to explain what the PR is doing, give it a meaningful title, and link to any related JIRA tickets.

david-yu · 2026-01-13T18:06:44Z

@joe-redpanda What are the chances of us being able to backport this to 25.3, 25.2, and 25.1? The reason we ask as is we can replace the logic we have entirely in our decommission controller as per @andrewstucki which would be huge for us in terms maintenance as this approach is much cleaner.

joe-redpanda · 2026-01-13T18:21:48Z

@joe-redpanda What are the chances of us being able to backport this to 25.3, 25.2, and 25.1? The reason we ask as is we can replace the logic we have entirely in our decommission controller as per @andrewstucki which would be huge for us in terms maintenance as this approach is much cleaner.

Backports are almost exclusively reserved for high impact bug fixes. I would say backporting a feature, and especially one of this magnitude is not a good idea.

We should also give this time to bake to crop up any unexpected behavior. I would like to avoid a situation where automatic decommission somehow triggers prematurely, leading to a kubernetes cluster with a pod that has been kicked out of the underlying redpanda cluster, because I imagine there are no automatic mitigation pathways that will correct this.

david-yu · 2026-01-13T18:27:17Z

Ok thank you I'll take your lead on this one. Not urgent but we thought it was worth asking.

Redpanda 26.1 adds native support for automatically decommissioning nodes that have been unavailable for a configurable timeout via the partition_autobalancing_node_autodecommission_time cluster property (redpanda-data/redpanda#28946). This eliminates the need for the operator's own ghost node ejection controller. Enable this feature by default with a 30-minute timeout (1800s) when managing Redpanda 26.1+ clusters. The timeout is only effective when partition_autobalancing_mode is set to "continuous" and can be overridden by users via config.cluster or config.extraClusterConfiguration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions Bot added the area/redpanda label Dec 11, 2025

joe-redpanda force-pushed the decom_old_nodes branch 3 times, most recently from bf81399 to 4006e5b Compare December 15, 2025 23:50

github-actions Bot added the area/build label Dec 15, 2025

joe-redpanda force-pushed the decom_old_nodes branch 7 times, most recently from efb685e to 818194f Compare December 18, 2025 19:38