Skip to content

Decom old nodes: CORE-7111#28946

Merged
joe-redpanda merged 10 commits into
redpanda-data:devfrom
joe-redpanda:decom_old_nodes
Jan 9, 2026
Merged

Decom old nodes: CORE-7111#28946
joe-redpanda merged 10 commits into
redpanda-data:devfrom
joe-redpanda:decom_old_nodes

Conversation

@joe-redpanda

@joe-redpanda joe-redpanda commented Dec 11, 2025

Copy link
Copy Markdown
Contributor

Feature PR for allowing brokers to automatically decommission brokers which have been unavailable for a certain timeout.

Adds last seen to a new report in the nodewise health report.

Processes last seen reports to find nodes which are past the decommission timeout on a quorum of nodes.

This creates a list of auto decom candidates, one of which will be selected (the lowers node id) to be automatically decommissioned. Nothing will be submitted for auto decommission if theres is an ongoing decommission.

Completes https://redpandadata.atlassian.net/browse/CORE-7111

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Improvements

  • Allow nodes to automatically decommission after a certain timeout

@joe-redpanda joe-redpanda force-pushed the decom_old_nodes branch 3 times, most recently from bf81399 to 4006e5b Compare December 15, 2025 23:50
@joe-redpanda joe-redpanda force-pushed the decom_old_nodes branch 7 times, most recently from efb685e to 818194f Compare December 18, 2025 19:38
Comment thread src/v/cluster/partition_balancer_planner.cc Outdated
Comment thread src/v/cluster/health_monitor_types.cc Outdated
Comment thread src/v/cluster/health_monitor_types.h Outdated
Comment thread src/v/cluster/tests/health_bench.cc Outdated
Comment thread src/v/cluster/tests/health_monitor_bench.cc Outdated
Comment thread src/v/cluster/tests/partition_balancer_simulator_test.cc Outdated
Comment thread src/v/cluster/tests/randoms.h Outdated
Comment thread src/v/cluster/tests/serialization_rt_test.cc Outdated
Comment thread src/v/cluster/controller.cc Outdated
Comment thread src/v/cluster/partition_balancer_planner.cc Outdated
Comment thread tests/rptest/tests/auto_decommission_test.py Outdated
Comment thread tests/rptest/tests/auto_decommission_test.py Outdated
@joe-redpanda joe-redpanda force-pushed the decom_old_nodes branch 3 times, most recently from 2260824 to 6e9fb88 Compare December 19, 2025 02:47
Comment thread src/v/config/configuration.cc Outdated
Comment thread tests/rptest/tests/auto_decommission_test.py Outdated
Comment thread src/v/utils/to_string.h Outdated
Comment thread src/v/cluster/partition_balancer_planner.cc Outdated
Comment thread src/v/cluster/partition_balancer_planner.cc Outdated
Comment thread src/v/cluster/health_monitor_types.h
Comment thread src/v/cluster/health_monitor_types.h
Comment thread src/v/cluster/partition_balancer_planner.cc Outdated
@joe-redpanda

Copy link
Copy Markdown
Contributor Author

Rebase onto dev to cut my build times, no relevant changes

Comment thread src/v/cluster/partition_balancer_planner.cc Outdated
Adds a convenience formatter for flat hash set s.t. it can be easily
logged.
Adds partition_autobalancing_node_autodecommission_timeout_sec
which is the time in seconds after which partition balancer
planner should begin decommissioning a node which is
unresponsive.
Wires partition_autobalancing_node_autodecommission_time into partition
balancer.

This commit adds the basicmost implementation of auto decommissioning
which is based on the last seen from the perspective of the current
controller broker. This implementation will run into problems when
controller leadership changes.

In future commits, this will be changed for a coordinated approach where
the partition_balancer_planner will instead use the cluster health
report to seek the consent of a quorum of nodes before decommissioning a
broker.
adds node_status to health monitor backend. This will be
used in future commits to create an auto decom status report
Adds a new struct to health_monitor_types: node_liveness_report.
This part of the health report will detail internode connectivity by
detailing when each node has last heard from every other node

Adds this struct to node_health_report and node_health_report_serde.

Adds build fixes needed given the above.
Adds population of node liveness report from the node_status_table.
This will iterate over all cluster members, fetching their last seen
from the node status table (if present) and populating the
node_liveness_report with the results.
to config

Adds node_autodecommission_timeout to
parition_balancer_planner::planner_config such that it can later be used
in auto decommission logic.

Adds necessary build fixes for this.
Adds the logic which will perform coordinated auto decommisson of nodes
which have elapsed their auto decom timeout.

Now, nodes will only be slated for auto decommission if a majority of
the nodes in the cluster have indicated that the node is derelict, by
sending a node_liveness_report where the node in question has been
unresponsive for longer than the autodecommission timeout.

partition_balancer_planner will not attempt to auto decom a node if
any node is currently decommmissioning

it will only attempt to decommission one node at a time instead of
decommissioning all nodes it finds to be derelict
Adds unit tests for the partition_balancer_planner's logic of
determining when to auto decommisssion a node.
Adds two tests.
1. smoke test: check that we can auto decom a node if it elapses the
   auto decom timeout
2. reset test: check that node restarts DO reset the timer on auto
   decommissioning
@joe-redpanda joe-redpanda merged commit 244dcba into redpanda-data:dev Jan 9, 2026
21 checks passed

@dotnwat dotnwat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joe-redpanda please update this PR cover letter to explain what the PR is doing, give it a meaningful title, and link to any related JIRA tickets.

@joe-redpanda joe-redpanda changed the title Decom old nodes Decom old nodes: CORE-7111 Jan 12, 2026
@david-yu

Copy link
Copy Markdown
Contributor

@joe-redpanda What are the chances of us being able to backport this to 25.3, 25.2, and 25.1? The reason we ask as is we can replace the logic we have entirely in our decommission controller as per @andrewstucki which would be huge for us in terms maintenance as this approach is much cleaner.

@joe-redpanda

Copy link
Copy Markdown
Contributor Author

@joe-redpanda What are the chances of us being able to backport this to 25.3, 25.2, and 25.1? The reason we ask as is we can replace the logic we have entirely in our decommission controller as per @andrewstucki which would be huge for us in terms maintenance as this approach is much cleaner.

Backports are almost exclusively reserved for high impact bug fixes. I would say backporting a feature, and especially one of this magnitude is not a good idea.

We should also give this time to bake to crop up any unexpected behavior. I would like to avoid a situation where automatic decommission somehow triggers prematurely, leading to a kubernetes cluster with a pod that has been kicked out of the underlying redpanda cluster, because I imagine there are no automatic mitigation pathways that will correct this.

@david-yu

Copy link
Copy Markdown
Contributor

Ok thank you I'll take your lead on this one. Not urgent but we thought it was worth asking.

david-yu added a commit to redpanda-data/redpanda-operator that referenced this pull request Mar 24, 2026
Redpanda 26.1 adds native support for automatically decommissioning
nodes that have been unavailable for a configurable timeout via the
partition_autobalancing_node_autodecommission_time cluster property
(redpanda-data/redpanda#28946). This eliminates the need for the
operator's own ghost node ejection controller.

Enable this feature by default with a 30-minute timeout (1800s) when
managing Redpanda 26.1+ clusters. The timeout is only effective when
partition_autobalancing_mode is set to "continuous" and can be overridden
by users via config.cluster or config.extraClusterConfiguration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants