[CORE-14581] cluster: partition_balancer multiple fibers in do_tick() check#28460
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds a defensive check to prevent multiple concurrent executions of the partition balancer's tick operation. The change introduces an early validation to detect when do_tick() is already running and immediately abort with an error rather than allowing potentially overlapping executions.
Key Changes:
- Added a guard check at the start of the tick lambda to detect concurrent tick execution attempts
| if (_tick_in_progress) { | ||
| vlog(clusterlog.error, "tick already in progress!"); | ||
| throw balancer_tick_aborted_exception( | ||
| "tick already in progress"); | ||
| } |
There was a problem hiding this comment.
The check for _tick_in_progress happens after the fiber has been spawned, which creates a race condition. If two ticks are scheduled concurrently, both fibers could read _tick_in_progress as false before either sets it to true. Move this check outside of spawn_with_gate_then or use proper synchronization (e.g., atomic compare-and-exchange) to prevent the race.
|
/ci-repeat 5 |
Make this air-tight.
ead50ec to
98fae50
Compare
|
/ci-repeat 5 |
CI test resultstest results on build#75956
test results on build#75974
|
nice! |
98fae50 to
0820cfb
Compare
cluster: partition_balancer multiple fibers in do_tick() checkcluster: partition_balancer multiple fibers in do_tick() check
| [this] { | ||
| if (_tick_in_progress) { | ||
| throw balancer_tick_aborted_exception( | ||
| "tick already in progress"); |
There was a problem hiding this comment.
This is going to give a confusing log message right? I wonder if we should just debug log and then return instead.
There was a problem hiding this comment.
it will just be a log at INFO level, not too confusing i don't think, but also perhaps more fit for DEBUG.
There was a problem hiding this comment.
Pushed an update to log at DEBUG and early return instead of use exception handling to log at INFO.
Check the value of `_tick_in_progress` before attempting a tick. Multiple concurrent fibers in this function at once is an issue.
0820cfb to
e7af181
Compare
|
/backport v25.2.x |
|
/backport v25.1.x |
Previously, we saw a SIGILL in CI that traces back to this lambda in
partition_balancer_backend:redpanda/src/v/cluster/partition_balancer_backend.cc
Lines 467 to 484 in 2ac5cb7
The issue here appears to be multiple fibers entering the continuation within
tick(): this could easily result in_tick_in_progressbeing unset in one fiber's invocation while the other fiber is still working:redpanda/src/v/cluster/partition_balancer_backend.cc
Lines 271 to 281 in 2ac5cb7
Prevent concurrent access in
partition_balancer_backend::do_tick()by assigning and checking_tick_in_progressup front.Fixes https://redpandadata.atlassian.net/browse/CORE-14581.
Backports Required
Release Notes
Bug Fixes
partition_balancer_backend.