Skip to content

partition_balancer: report quorum loss moves as...#28393

Merged
joe-redpanda merged 1 commit into
redpanda-data:devfrom
joe-redpanda:deflake_decom_test
Nov 21, 2025
Merged

partition_balancer: report quorum loss moves as...#28393
joe-redpanda merged 1 commit into
redpanda-data:devfrom
joe-redpanda:deflake_decom_test

Conversation

@joe-redpanda

@joe-redpanda joe-redpanda commented Nov 6, 2025

Copy link
Copy Markdown
Contributor

...immutable

nodes_decommissioning_test.py
::NodesDecommissioningTest
.test_decommissioning_node_rf_1_replica

would periodically fail on partitions not being reported as allocation failures. This happened because there was a race. A partition would NOT be reported as an allocation failure if there was a move in progress.

In this test, the node is stopped and then the decommed. As a result, the broker could be picked up as unresponsive, which would init a move before the decomission is made visible to the planner.

This commit changes pbp to report in-progress moves with quorum loss on the original replica set as immutable.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Improvements

  • partitions with an in-flight move and original replica quorum loss will now be reported as immutable

immutable

nodes_decommissioning_test.py
::NodesDecommissioningTest
.test_decommissioning_node_rf_1_replica

would periodically fail on partitions not being reported as allocation
failures. This happened because there was a race. A partition would NOT
be reported as an allocation failure if there was a move in progress.

In this test, the node is stopped and then the decommed. As a result,
the broker could be picked up as unresponsive, which would init a move
before the decomission is made visible to the planner.

This commit changes pbp to report in-progress moves with quorum loss on
the original replica set as immutable.
@joe-redpanda

Copy link
Copy Markdown
Contributor Author

/ci-repeat 1

@vbotbuildovich

vbotbuildovich commented Nov 6, 2025

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#75724
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_topic_delete {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/75724#019a57b8-c47a-440a-b737-fee55e0619b9 FLAKY 20/21 upstream reliability is '99.7584541062802'. current run reliability is '95.23809523809523'. drift is 4.52036 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_topic_delete
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/75724#019a57b8-c484-471b-9366-78f92601c7dd FLAKY 19/21 upstream reliability is '90.0925925925926'. current run reliability is '90.47619047619048'. drift is -0.3836 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#75764
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/75764#019a5a01-c31c-46e3-a17e-95a99496a739 FLAKY 19/21 upstream reliability is '96.55581947743468'. current run reliability is '90.47619047619048'. drift is 6.07963 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/75764#019a5a01-c31b-4fae-b894-8533847f43aa FLAKY 20/21 upstream reliability is '97.796817625459'. current run reliability is '95.23809523809523'. drift is 2.55872 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

@joe-redpanda joe-redpanda marked this pull request as ready for review November 6, 2025 15:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in the partition balancer planner where partitions that have lost quorum during an in-progress move were not being reported as allocation failures. The fix ensures that partitions with an in-flight move and quorum loss on the original replica set are now correctly reported as immutable.

Key Changes:

  • Added quorum loss detection for partitions during in-progress moves
  • Partitions that lost quorum during moves are now reported as immutable with no_quorum reason
  • Restructured control flow to handle the new quorum loss case before attempting cancellations

Comment thread src/v/cluster/partition_balancer_planner.cc

@bharathv bharathv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link the jira?

@joe-redpanda

Copy link
Copy Markdown
Contributor Author

@joe-redpanda joe-redpanda merged commit 9947b85 into redpanda-data:dev Nov 21, 2025
22 of 24 checks passed
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.3.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.1.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v24.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants