Skip to content

[CORE-14581] cluster: partition_balancer multiple fibers in do_tick() check#28460

Merged
WillemKauf merged 2 commits into
redpanda-data:devfrom
WillemKauf:partition_balancer_check
Nov 11, 2025
Merged

[CORE-14581] cluster: partition_balancer multiple fibers in do_tick() check#28460
WillemKauf merged 2 commits into
redpanda-data:devfrom
WillemKauf:partition_balancer_check

Conversation

@WillemKauf

@WillemKauf WillemKauf commented Nov 10, 2025

Copy link
Copy Markdown
Contributor

Previously, we saw a SIGILL in CI that traces back to this lambda in partition_balancer_backend:

co_await ss::max_concurrent_for_each(
nodes_to_finish, 32, [this](model::node_id node) {
_tick_in_progress->check();
return _members_frontend
.finish_node_reallocations(node, _cur_term->id)
.then([node](auto errc) {
if (errc) {
vlog(
clusterlog.warn,
"submitting finish_reallocations for node {} failed, "
"error: {}",
node,
errc.message());
}
});
});
}

The issue here appears to be multiple fibers entering the continuation within tick(): this could easily result in _tick_in_progress being unset in one fiber's invocation while the other fiber is still working:

void partition_balancer_backend::tick() {
ssx::background
= ssx::spawn_with_gate_then(
_gate,
[this] {
return do_tick().finally([this] {
_tick_in_progress = {};
maybe_rearm_timer(
_cur_term && _cur_term->_force_health_report_refresh);
});
})

Prevent concurrent access in partition_balancer_backend::do_tick() by assigning and checking _tick_in_progress up front.

Fixes https://redpandadata.atlassian.net/browse/CORE-14581.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Bug Fixes

  • Prevent a very narrow race condition in the partition_balancer_backend.

Copilot AI review requested due to automatic review settings November 10, 2025 20:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a defensive check to prevent multiple concurrent executions of the partition balancer's tick operation. The change introduces an early validation to detect when do_tick() is already running and immediately abort with an error rather than allowing potentially overlapping executions.

Key Changes:

  • Added a guard check at the start of the tick lambda to detect concurrent tick execution attempts

Comment on lines +276 to +280
if (_tick_in_progress) {
vlog(clusterlog.error, "tick already in progress!");
throw balancer_tick_aborted_exception(
"tick already in progress");
}

Copilot AI Nov 10, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for _tick_in_progress happens after the fiber has been spawned, which creates a race condition. If two ticks are scheduled concurrently, both fibers could read _tick_in_progress as false before either sets it to true. Move this check outside of spawn_with_gate_then or use proper synchronization (e.g., atomic compare-and-exchange) to prevent the race.

Copilot uses AI. Check for mistakes.
@WillemKauf

Copy link
Copy Markdown
Contributor Author

/ci-repeat 5
dt-repeat=20
skip-units
tests/rptest/transactions/tx_atomic_produce_consume_test.py

@WillemKauf WillemKauf force-pushed the partition_balancer_check branch from ead50ec to 98fae50 Compare November 10, 2025 20:56
@WillemKauf

Copy link
Copy Markdown
Contributor Author

/ci-repeat 5
dt-repeat=20
skip-units
tests/rptest/transactions/tx_atomic_produce_consume_test.py

@vbotbuildovich

vbotbuildovich commented Nov 10, 2025

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#75956
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
TxAtomicProduceConsumeTest test_basic_tx_consumer_transform_produce {"with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/75956#019a6fb7-b878-45cf-b165-553455581e6e FLAKY 39/40 upstream reliability is '99.55156950672645'. current run reliability is '95.23809523809523'. drift is 4.31347 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxAtomicProduceConsumeTest&test_method=test_basic_tx_consumer_transform_produce
TxAtomicProduceConsumeTest test_basic_tx_consumer_transform_produce {"with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/75956#019a6fb7-b87b-4023-91a7-5c2c754781a6 FLAKY 39/40 upstream reliability is '99.5505617977528'. current run reliability is '95.23809523809523'. drift is 4.31247 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxAtomicProduceConsumeTest&test_method=test_basic_tx_consumer_transform_produce
TxAtomicProduceConsumeTest test_basic_tx_consumer_transform_produce {"with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/75956#019a6fea-cd8c-4f0d-8010-a5bcbb90379d FLAKY 39/40 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxAtomicProduceConsumeTest&test_method=test_basic_tx_consumer_transform_produce
test results on build#75974
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_with_failures null integration https://buildkite.com/redpanda/redpanda/builds/75974#019a7044-d3c7-405b-a0ae-c1e58d767689 FLAKY 20/21 upstream reliability is '96.73704414587331'. current run reliability is '95.23809523809523'. drift is 1.49895 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_failures
ShadowLinkingReplicationTests test_replication_with_failures null integration https://buildkite.com/redpanda/redpanda/builds/75974#019a7046-2504-49e9-b04d-b5fdad32213e FLAKY 20/21 upstream reliability is '96.74952198852772'. current run reliability is '95.23809523809523'. drift is 1.51143 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_failures

@WillemKauf

Copy link
Copy Markdown
Contributor Author

https://ci-artifacts.dev.vectorized.cloud/production-devprod/redpanda/75956/019a6fea-cd8c-4f0d-8010-a5bcbb90379d/vbuild/ducktape/results/2025-11-10--001/report.html

rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-10(2) example="ERROR 2025-11-10 23:02:24,082 [shard 0:clus] cluster - partition_balancer_backend.cc:277 - tick already in progress!">

nice!

@WillemKauf WillemKauf force-pushed the partition_balancer_check branch from 98fae50 to 0820cfb Compare November 10, 2025 23:44
@WillemKauf WillemKauf changed the title cluster: partition_balancer multiple fibers in do_tick() check [CORE-14581] cluster: partition_balancer multiple fibers in do_tick() check Nov 10, 2025
rockwotj
rockwotj previously approved these changes Nov 11, 2025
[this] {
if (_tick_in_progress) {
throw balancer_tick_aborted_exception(
"tick already in progress");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to give a confusing log message right? I wonder if we should just debug log and then return instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will just be a log at INFO level, not too confusing i don't think, but also perhaps more fit for DEBUG.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed an update to log at DEBUG and early return instead of use exception handling to log at INFO.

Check the value of `_tick_in_progress` before attempting a tick.
Multiple concurrent fibers in this function at once is an issue.
@WillemKauf WillemKauf enabled auto-merge November 11, 2025 14:51
@WillemKauf WillemKauf merged commit 7905c38 into redpanda-data:dev Nov 11, 2025
18 checks passed
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants