cluster: Add greedy topic aware balancer#29908
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new greedy topic-aware leader balancer mode intended to produce stable, per-topic leader distributions (useful for benchmarks and reducing run-to-run variance), and wires it into configuration and tests.
Changes:
- Introduce
greedyas a newleader_balancer_modeand route leader balancer execution to a new greedy strategy implementation. - Add extensive unit tests for greedy behavior (topic-level fairness, shard balancing, muted/skipped groups, and rack pinning interaction).
- Update rptest coverage/types to include a new ducktape test for greedy mode and tighten typing on
leadership_transfer_test.py.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/type-checking/type-check-strictness.json | Removes leadership_transfer_test.py from the non-strict list (it now passes strict type checking). |
| tests/rptest/tests/leadership_transfer_test.py | Refactors topic-aware rebalance testing and adds a greedy-mode ducktape test; improves typing/metadata parsing. |
| src/v/model/metadata.h | Adds leader_balancer_mode::greedy and string conversion support. |
| src/v/config/convert.h | Extends config parsing/validation to accept "greedy" for leader_balancer_mode. |
| src/v/config/configuration.cc | Allows greedy as a valid leader_balancer_mode value. |
| src/v/cluster/scheduling/leader_balancer_greedy.h | Declares the new greedy topic-aware strategy. |
| src/v/cluster/scheduling/leader_balancer_greedy.cc | Implements greedy pre-planned target assignment + move emission. |
| src/v/cluster/scheduling/leader_balancer.cc | Wires the new greedy mode into the balancer strategy selection. |
| src/v/cluster/tests/leader_balancer_greedy_test.cc | Adds unit test coverage for the greedy strategy. |
| src/v/cluster/tests/BUILD | Registers the new greedy gtest target. |
| src/v/cluster/BUILD | Adds greedy strategy sources/headers to the cluster library build. |
| TimeoutError: If the conditions are not met within timeout_sec | ||
| """ | ||
| excluded = set(dead_racks) if dead_racks else set() | ||
| excluded = set(dead_racks) if dead_racks else set[str]() |
There was a problem hiding this comment.
Using set[str]() here relies on calling a GenericAlias at runtime and is uncommon in the rest of the codebase. Prefer an explicit type annotation with set() (e.g., excluded: set[str] = set(dead_racks) if dead_racks else set()), which is clearer and avoids any runtime quirks across Python versions.
| excluded = set(dead_racks) if dead_racks else set[str]() | |
| excluded: set[str] = set(dead_racks) if dead_racks else set() |
7f66bb1 to
632460d
Compare
| build_target_assignment(); | ||
| } | ||
|
|
||
| double greedy_topic_aware_strategy::error() const { |
There was a problem hiding this comment.
Is this used during the balancing? It is not "topic only", not sure if that matters.
There was a problem hiding this comment.
It's only used during logging. Kinda odd I find
There was a problem hiding this comment.
Probably the balancers have outgrown the original abstraction which assumed they'd have a certain shape (and meaningful error() function, etc).
| continue; | ||
| } | ||
|
|
||
| partitions_by_topic[topic_iterator->second].push_back( |
There was a problem hiding this comment.
nit: prefer emplace back + direct construction
| } | ||
|
|
||
| // zero init such that we can get broker count easily | ||
| global_broker_counts[leader.node_id]; |
There was a problem hiding this comment.
This is suspicious. It is so global_broker_counts.size(); can be used below to get num_brokers? Amazingly, it doesn't seem like the info passed to the balancer actually contains a list of nodes anywhere, so any node which is not a leader for at least one group, will basically be ignored. It is what it is, I guess: I suppose the large number of system partitions (consumer offsets...) + luck means that each node usually leads at least 1 partition so this just works out.
There was a problem hiding this comment.
It is so global_broker_counts.size(); can be used below to get num_brokers
Yes correct
any node which is not a leader for at least one group, will basically be ignored
Indeed, though for the purposes of this it doesn't matter (can't put leadership on a node if it doesn't have any of the replicas).
There was a problem hiding this comment.
Unlike partition balancer, leader balancer cannot move replicas across nodes, as it can only select leaders out of existing replicas. If a node does not have any replicas there's nothing we can do with it here.
We can use replicas in the inner loop to detect nodes that host only non-leader replicas. It's not uncommon, e.g. a node may have been restarted recently.
There was a problem hiding this comment.
Indeed, though for the purposes of this it doesn't matter (can't put leadership on a node if it doesn't have any of the replicas).
Right, but the current node ignores nodes which don't have any leaders, even though there may be nodes that have replicas but no leaders? That seems wrong. Same comment as Bash effectively.
There was a problem hiding this comment.
Oh right, I see what you are saying now.
So fix is either to add:
// zero init such that we can get broker count easily
for (const auto& replica : replicas) {
global_broker_counts[replica.node_id];
}
into the inner loop there or just assume it will always work like you say.
There was a problem hiding this comment.
Fixed. A lot better now.
| std::optional<std::tuple< | ||
| size_t, | ||
| size_t, | ||
| size_t, | ||
| size_t, | ||
| size_t, | ||
| model::broker_shard>> |
There was a problem hiding this comment.
A faint but unmistakable scream of struuuuuuuuct was heard...
There was a problem hiding this comment.
It may be neater to evaluate each criteria for all replicas using a utility function akin std::ranges::max_element, but only returning a non-sentinel iterator when it's strictly better than others.
There was a problem hiding this comment.
Luckily I am only like 12000km away.
| size_t, | ||
| model::broker_shard>> | ||
| best_assignment; | ||
| for (const model::broker_shard& replica : partition.replicas) { |
There was a problem hiding this comment.
Not sure if you did any perf measurements, but I would expect a min-heap to be much faster here for high total shard counts. Something to keep in mind for the future, perhaps.
There was a problem hiding this comment.
Are we talking about a min-heap of all replicas of a partition, i.e. typically size 3?
There was a problem hiding this comment.
Yes nevermind, I was thinking of "all broker-shards" but of course we are only selecting from the ~3 replicas. My comment probably applies to a hypothetical review for a non-existent change where we are doing a similar thing to the partition allocator.
| for (const model::broker_shard& replica : partition.replicas) { | ||
| size_t broker_count = assigned_broker_counts[replica.node_id]; | ||
| // Lower-is-better on every element: | ||
| // 1) excess above per-topic floor (0 while at or below |
There was a problem hiding this comment.
Does 1 even do anything given that 2 is next in the list?
Given two topics t0 and t1, if t0 has lower (1) than t1, it would also have lower (2), I think?
There was a problem hiding this comment.
No because on a topic level you can be 1 1 1 1 but possibly brokers already are 2 2 1 1 or so (this is what TwoTopicsFourBrokersThreeShards covers).
Also important to keep in mind that because of the total order as soon as an earlier comparison ranks you smaller later ones don't matter anymore. So these additional features do is just help a little to better distribute in cases as discussed in standup. A second pass would equally help.
There was a problem hiding this comment.
No because on a topic level you can be 1 1 1 1 but possibly brokers already are 2 2 1 1 or so
Not sure how many partitions in the topic (1 1 1 1), but it must be at least 4, so excess is 0 here for all brokers and so (1) does nothing?
But I guess I see what it does: it considers only brokers (not broker-shards) so it enforces broker balance instead of just shard balance, since if you only did that you'd start overloading the lower brokers since they still have shards with 0 partitions.
| } | ||
| for (const auto& [shard, count] : assigned_shard_counts) { | ||
| global_shard_counts[shard] += count; | ||
| } |
There was a problem hiding this comment.
Yeah I guess the tie-breaking ended up as more than just a few lines of code. Sorry. What I had in mind was a different approach: not looking any global counts, but for each topic you just "rotate" the broker-shard that's first in the list.
E.g. for first topic, list is {broker 0}, 1, 2, 3 then next rotate it to 1, 2, 3, 0, etc (no actual rotation, the comparison operation just needs to handle it).
Not suggesting any change, just food for thought.
There was a problem hiding this comment.
Yeah some first version was like that until I simplified and handled more cases.
travisdowns
left a comment
There was a problem hiding this comment.
Consider removing the tuple in favor of struct.
bharathv
left a comment
There was a problem hiding this comment.
no more comments.. requested @bashtanov to take a look as well, he spent a lot of time on leader balancer recently.
| // `partitions_by_topic` is the input to the greedy planner grouped by | ||
| // topic. Each entry stores the raft groups for one topic along with the | ||
| // broker-shards that can legally host leadership for those groups. | ||
| std::map<topic_id_t, std::vector<partition_info>> partitions_by_topic; |
There was a problem hiding this comment.
std::map is non-contiguous so oversized allocs should be an issue (not 100% sure whether I get the comment).
There was a problem hiding this comment.
Sorry, I meant the std::vector<partition_info>, that seems like something prone to large allocations?
There was a problem hiding this comment.
Oh yes, good catch. Will fix.
Another example of LLMs just forgetting stuff even at mediocre context window sizes.
10a1c07
632460d to
10a1c07
Compare
|
I have pushed requested changes from Travis. Note the TwoTopicsFourBrokers had a bug (rf=4 - thanks Opus) so it actually also doesn't distribute perfectly (as does the new 6 broker tests). I have left the tests in to document that but won't improve on that as discussed today (not worth it, needs second pass). |
10a1c07 to
e197060
Compare
| size_t global_shard_counts; | ||
| model::broker_shard replica; | ||
|
|
||
| bool operator<(const leader_preference& other) const { |
There was a problem hiding this comment.
can't this just be replaced by = default; ?
There was a problem hiding this comment.
Oh yes indeed. Just needs operator<=> defaulting.
There was a problem hiding this comment.
Yes, you can also just default < too, though there is not much difference (since they are effectively "inlined" they don't get generated unless needed, so <=> is fine too, perhaps preferrable?).
e197060 to
59f8411
Compare
CI test resultstest results on build#82275
test results on build#82284
|
This patch adds a new greedy topic-aware strategy. Its goal is to distribute leadership across brokers and shards in a topic aware fashion meaning that a single topic will be distributed across all brokers independent of global total leaderships. The latter is only used as a tie breaker for topics with partition counts that don't perfectly divide across the broker count. It's different to existing strategies in that it's entirely greedy and pre-calculates the expected distribution in a stable fashion. This means: - It won't get stuck in local minima if additional moves don't improve the situtation. - It's not trying to minimize the moves required to get from the current state to the target state. It's primary purpose is to be used in benchmarking where we want to reduce noise at all costs and want to avoid different uneven leadership across different runs. An issue that we are currently seeing is that for example for a three broker setup and a topic with 18 partitions we would see leader distributions of 5/6/7 due to other existing topics like __consumer_offsets. We want a guaranteed behaviour across runs. However a similar problem exists in "production" when multiple user topics are at play. With two 18 partition topics you might get distributions of 4/6/8 and 8/6/4 which can cause a big load discrepancy if the topics are not equally loaded. This mode might help with that as well. The leader balancer currently runs every two minutes. This isn't great because the various indexes it builds in each run can get quite expensive with high partition counts. This is worse with the greedy balancer due to its nature of generating the full target layout upfront. This is something we'll want to fix separately at the leader balancer layer to avoid all these costs in general if nothing has changed.
d8bd369 to
fb0a1ab
Compare
Add unit tests for various partition distribution scenarios.
For the purposes of simplicity just take the preference always.
Add typing.
Adds a basic DT integration test for one of the scenarios where the other strategies would not get a perfect topic aware distribution.
The MultiTopicAutomaticLeadershipBalancingTest test was previously not very strict in what it enforces (well it wouldn't pass otherwise). Add some optional enforcement such that we check that the greedy strategy actually gets perfect distribution.
Now that we actually enforce the target leadership we need to make the test a lot more robust. Increase timeouts and add more targeted intermediate steps. Main source of flakiness here is that after restart some of the nodes get muted so the leadership stabilization is getting delayed. ``` task rp:run-bazel-ducktape-tests DUCKTAPE_ARGS="tests/rptest/tests/leadership_transfer_test.py::GreedyLeaderBalancingTest --repeat=20 --max-parallel 10" BUILD_TYPE=release ... SESSION REPORT (ALL TESTS) ducktape version: 0.12.0 session_id: 2026-03-20--009 run time: 6 minutes 11.530 seconds tests run: 20 passed: 20 flaky: 0 failed: 0 ignored: 0 ```
fb0a1ab to
8d9db53
Compare
This patch adds a new greedy topic-aware strategy. Its goal is to
distribute leadership across brokers and shards in a topic aware fashion
meaning that a single topic will be distributed across all brokers
independent of global total leaderships. The latter is only used as a
tie breaker for topics with partition counts that don't perfectly divide
across the broker count.
It's different to existing strategies in that it's entirely greedy and
pre-calculates the expected distribution in a stable fashion. This means:
the situtation.
current state to the target state.
It's primary purpose is to be used in benchmarking where we want to
reduce noise at all costs and want to avoid different uneven leadership
across different runs.
An issue that we are currently seeing is that for example for a three
broker setup and a topic with 18 partitions we would see leader
distributions of 5/6/7 due to other existing topics like
__consumer_offsets. We want a guaranteed behaviour across runs.
However a similar problem exists in "production" when multiple user topics
are at play. With two 18 partition topics you might get distributions of
4/6/8 and 8/6/4 which can cause a big load discrepancy if the topics are
not equally loaded. This mode might help with that as well.
The leader balancer currently runs every two minutes. This isn't great
because the various indexes it builds in each run can get quite
expensive with high partition counts. This is worse with the greedy
balancer due to its nature of generating the full target layout upfront.
This is something we'll want to fix separately at the leader balancer
layer to avoid all these costs in general if nothing has changed.
Backports Required
Release Notes