Skip to content

cluster: Add greedy topic aware balancer#29908

Merged
StephanDollberg merged 7 commits into
devfrom
stephan/greedy-leader-balancer
Mar 24, 2026
Merged

cluster: Add greedy topic aware balancer#29908
StephanDollberg merged 7 commits into
devfrom
stephan/greedy-leader-balancer

Conversation

@StephanDollberg

Copy link
Copy Markdown
Member

This patch adds a new greedy topic-aware strategy. Its goal is to
distribute leadership across brokers and shards in a topic aware fashion
meaning that a single topic will be distributed across all brokers
independent of global total leaderships. The latter is only used as a
tie breaker for topics with partition counts that don't perfectly divide
across the broker count.

It's different to existing strategies in that it's entirely greedy and
pre-calculates the expected distribution in a stable fashion. This means:

  • It won't get stuck in local minima if additional moves don't improve
    the situtation.
  • It's not trying to minimize the moves required to get from the
    current state to the target state.

It's primary purpose is to be used in benchmarking where we want to
reduce noise at all costs and want to avoid different uneven leadership
across different runs.

An issue that we are currently seeing is that for example for a three
broker setup and a topic with 18 partitions we would see leader
distributions of 5/6/7 due to other existing topics like
__consumer_offsets. We want a guaranteed behaviour across runs.

However a similar problem exists in "production" when multiple user topics
are at play. With two 18 partition topics you might get distributions of
4/6/8 and 8/6/4 which can cause a big load discrepancy if the topics are
not equally loaded. This mode might help with that as well.

The leader balancer currently runs every two minutes. This isn't great
because the various indexes it builds in each run can get quite
expensive with high partition counts. This is worse with the greedy
balancer due to its nature of generating the full target layout upfront.
This is something we'll want to fix separately at the leader balancer
layer to avoid all these costs in general if nothing has changed.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new greedy topic-aware leader balancer mode intended to produce stable, per-topic leader distributions (useful for benchmarks and reducing run-to-run variance), and wires it into configuration and tests.

Changes:

  • Introduce greedy as a new leader_balancer_mode and route leader balancer execution to a new greedy strategy implementation.
  • Add extensive unit tests for greedy behavior (topic-level fairness, shard balancing, muted/skipped groups, and rack pinning interaction).
  • Update rptest coverage/types to include a new ducktape test for greedy mode and tighten typing on leadership_transfer_test.py.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tools/type-checking/type-check-strictness.json Removes leadership_transfer_test.py from the non-strict list (it now passes strict type checking).
tests/rptest/tests/leadership_transfer_test.py Refactors topic-aware rebalance testing and adds a greedy-mode ducktape test; improves typing/metadata parsing.
src/v/model/metadata.h Adds leader_balancer_mode::greedy and string conversion support.
src/v/config/convert.h Extends config parsing/validation to accept "greedy" for leader_balancer_mode.
src/v/config/configuration.cc Allows greedy as a valid leader_balancer_mode value.
src/v/cluster/scheduling/leader_balancer_greedy.h Declares the new greedy topic-aware strategy.
src/v/cluster/scheduling/leader_balancer_greedy.cc Implements greedy pre-planned target assignment + move emission.
src/v/cluster/scheduling/leader_balancer.cc Wires the new greedy mode into the balancer strategy selection.
src/v/cluster/tests/leader_balancer_greedy_test.cc Adds unit test coverage for the greedy strategy.
src/v/cluster/tests/BUILD Registers the new greedy gtest target.
src/v/cluster/BUILD Adds greedy strategy sources/headers to the cluster library build.

Comment thread src/v/cluster/scheduling/leader_balancer_greedy.cc
Comment thread src/v/cluster/scheduling/leader_balancer_greedy.cc
Comment thread src/v/config/configuration.cc
Comment thread src/v/config/convert.h
TimeoutError: If the conditions are not met within timeout_sec
"""
excluded = set(dead_racks) if dead_racks else set()
excluded = set(dead_racks) if dead_racks else set[str]()

Copilot AI Mar 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using set[str]() here relies on calling a GenericAlias at runtime and is uncommon in the rest of the codebase. Prefer an explicit type annotation with set() (e.g., excluded: set[str] = set(dead_racks) if dead_racks else set()), which is clearer and avoids any runtime quirks across Python versions.

Suggested change
excluded = set(dead_racks) if dead_racks else set[str]()
excluded: set[str] = set(dead_racks) if dead_racks else set()

Copilot uses AI. Check for mistakes.
@StephanDollberg StephanDollberg force-pushed the stephan/greedy-leader-balancer branch from 7f66bb1 to 632460d Compare March 23, 2026 09:55
build_target_assignment();
}

double greedy_topic_aware_strategy::error() const {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used during the balancing? It is not "topic only", not sure if that matters.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only used during logging. Kinda odd I find

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the balancers have outgrown the original abstraction which assumed they'd have a certain shape (and meaningful error() function, etc).

Comment thread src/v/cluster/scheduling/leader_balancer_greedy.cc
continue;
}

partitions_by_topic[topic_iterator->second].push_back(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer emplace back + direct construction

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

// zero init such that we can get broker count easily
global_broker_counts[leader.node_id];

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suspicious. It is so global_broker_counts.size(); can be used below to get num_brokers? Amazingly, it doesn't seem like the info passed to the balancer actually contains a list of nodes anywhere, so any node which is not a leader for at least one group, will basically be ignored. It is what it is, I guess: I suppose the large number of system partitions (consumer offsets...) + luck means that each node usually leads at least 1 partition so this just works out.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is so global_broker_counts.size(); can be used below to get num_brokers

Yes correct

any node which is not a leader for at least one group, will basically be ignored

Indeed, though for the purposes of this it doesn't matter (can't put leadership on a node if it doesn't have any of the replicas).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike partition balancer, leader balancer cannot move replicas across nodes, as it can only select leaders out of existing replicas. If a node does not have any replicas there's nothing we can do with it here.

We can use replicas in the inner loop to detect nodes that host only non-leader replicas. It's not uncommon, e.g. a node may have been restarted recently.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, though for the purposes of this it doesn't matter (can't put leadership on a node if it doesn't have any of the replicas).

Right, but the current node ignores nodes which don't have any leaders, even though there may be nodes that have replicas but no leaders? That seems wrong. Same comment as Bash effectively.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I see what you are saying now.

So fix is either to add:

            // zero init such that we can get broker count easily
            for (const auto& replica : replicas) {
                global_broker_counts[replica.node_id];
            }

into the inner loop there or just assume it will always work like you say.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. A lot better now.

Comment on lines +131 to +137
std::optional<std::tuple<
size_t,
size_t,
size_t,
size_t,
size_t,
model::broker_shard>>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A faint but unmistakable scream of struuuuuuuuct was heard...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be neater to evaluate each criteria for all replicas using a utility function akin std::ranges::max_element, but only returning a non-sentinel iterator when it's strictly better than others.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luckily I am only like 12000km away.

size_t,
model::broker_shard>>
best_assignment;
for (const model::broker_shard& replica : partition.replicas) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you did any perf measurements, but I would expect a min-heap to be much faster here for high total shard counts. Something to keep in mind for the future, perhaps.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we talking about a min-heap of all replicas of a partition, i.e. typically size 3?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes nevermind, I was thinking of "all broker-shards" but of course we are only selecting from the ~3 replicas. My comment probably applies to a hypothetical review for a non-existent change where we are doing a similar thing to the partition allocator.

for (const model::broker_shard& replica : partition.replicas) {
size_t broker_count = assigned_broker_counts[replica.node_id];
// Lower-is-better on every element:
// 1) excess above per-topic floor (0 while at or below

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 1 even do anything given that 2 is next in the list?

Given two topics t0 and t1, if t0 has lower (1) than t1, it would also have lower (2), I think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because on a topic level you can be 1 1 1 1 but possibly brokers already are 2 2 1 1 or so (this is what TwoTopicsFourBrokersThreeShards covers).

Also important to keep in mind that because of the total order as soon as an earlier comparison ranks you smaller later ones don't matter anymore. So these additional features do is just help a little to better distribute in cases as discussed in standup. A second pass would equally help.

@travisdowns travisdowns Mar 24, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because on a topic level you can be 1 1 1 1 but possibly brokers already are 2 2 1 1 or so

Not sure how many partitions in the topic (1 1 1 1), but it must be at least 4, so excess is 0 here for all brokers and so (1) does nothing?

But I guess I see what it does: it considers only brokers (not broker-shards) so it enforces broker balance instead of just shard balance, since if you only did that you'd start overloading the lower brokers since they still have shards with 0 partitions.

Comment thread src/v/cluster/scheduling/leader_balancer_greedy.cc Outdated
}
for (const auto& [shard, count] : assigned_shard_counts) {
global_shard_counts[shard] += count;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess the tie-breaking ended up as more than just a few lines of code. Sorry. What I had in mind was a different approach: not looking any global counts, but for each topic you just "rotate" the broker-shard that's first in the list.

E.g. for first topic, list is {broker 0}, 1, 2, 3 then next rotate it to 1, 2, 3, 0, etc (no actual rotation, the comparison operation just needs to handle it).

Not suggesting any change, just food for thought.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah some first version was like that until I simplified and handled more cases.

travisdowns
travisdowns previously approved these changes Mar 23, 2026
@travisdowns travisdowns self-requested a review March 23, 2026 15:35

@travisdowns travisdowns left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing the tuple in favor of struct.

@bharathv bharathv requested a review from bashtanov March 23, 2026 18:47
bharathv
bharathv previously approved these changes Mar 23, 2026

@bharathv bharathv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no more comments.. requested @bashtanov to take a look as well, he spent a lot of time on leader balancer recently.

// `partitions_by_topic` is the input to the greedy planner grouped by
// topic. Each entry stores the raft groups for one topic along with the
// broker-shards that can legally host leadership for those groups.
std::map<topic_id_t, std::vector<partition_info>> partitions_by_topic;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunked versions?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::map is non-contiguous so oversized allocs should be an issue (not 100% sure whether I get the comment).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant the std::vector<partition_info>, that seems like something prone to large allocations?

@StephanDollberg StephanDollberg Mar 24, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, good catch. Will fix.

Another example of LLMs just forgetting stuff even at mediocre context window sizes.

@StephanDollberg StephanDollberg dismissed stale reviews from bharathv and travisdowns via 10a1c07 March 23, 2026 22:40
@StephanDollberg StephanDollberg force-pushed the stephan/greedy-leader-balancer branch from 632460d to 10a1c07 Compare March 23, 2026 22:40
@StephanDollberg

Copy link
Copy Markdown
Member Author

I have pushed requested changes from Travis.

Note the TwoTopicsFourBrokers had a bug (rf=4 - thanks Opus) so it actually also doesn't distribute perfectly (as does the new 6 broker tests). I have left the tests in to document that but won't improve on that as discussed today (not worth it, needs second pass).

@StephanDollberg StephanDollberg force-pushed the stephan/greedy-leader-balancer branch from 10a1c07 to e197060 Compare March 24, 2026 09:49
size_t global_shard_counts;
model::broker_shard replica;

bool operator<(const leader_preference& other) const {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this just be replaced by = default; ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes indeed. Just needs operator<=> defaulting.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can also just default < too, though there is not much difference (since they are effectively "inlined" they don't get generated unless needed, so <=> is fine too, perhaps preferrable?).

travisdowns
travisdowns previously approved these changes Mar 24, 2026
travisdowns
travisdowns previously approved these changes Mar 24, 2026
@vbotbuildovich

vbotbuildovich commented Mar 24, 2026

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#82275
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CloudTopicsL0GCAdminTest test_single_node_pause_unpause {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/82275#019d200a-8593-459b-a0ac-ba47682e3b2e FLAKY 19/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0241, p0=0.3867, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3917, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCAdminTest&test_method=test_single_node_pause_unpause
test results on build#82284
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/82284#019d20b5-6dd4-4d1f-9349-0b751d34bdd1 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0057, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
TxAtomicProduceConsumeTest test_basic_tx_consumer_transform_produce {"with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/82284#019d20b5-6dd5-480d-95df-05de1f1c1250 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0062, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxAtomicProduceConsumeTest&test_method=test_basic_tx_consumer_transform_produce

This patch adds a new greedy topic-aware strategy. Its goal is to
distribute leadership across brokers and shards in a topic aware fashion
meaning that a single topic will be distributed across all brokers
independent of global total leaderships. The latter is only used as a
tie breaker for topics with partition counts that don't perfectly divide
across the broker count.

It's different to existing strategies in that it's entirely greedy and
pre-calculates the expected distribution in a stable fashion. This means:
 - It won't get stuck in local minima if additional moves don't improve
   the situtation.
 - It's not trying to minimize the moves required to get from the
   current state to the target state.

It's primary purpose is to be used in benchmarking where we want to
reduce noise at all costs and want to avoid different uneven leadership
across different runs.

An issue that we are currently seeing is that for example for a three
broker setup and a topic with 18 partitions we would see leader
distributions of 5/6/7 due to other existing topics like
__consumer_offsets. We want a guaranteed behaviour across runs.

However a similar problem exists in "production" when multiple user topics
are at play. With two 18 partition topics you might get distributions of
4/6/8 and 8/6/4 which can cause a big load discrepancy if the topics are
not equally loaded. This mode might help with that as well.

The leader balancer currently runs every two minutes. This isn't great
because the various indexes it builds in each run can get quite
expensive with high partition counts. This is worse with the greedy
balancer due to its nature of generating the full target layout upfront.
This is something we'll want to fix separately at the leader balancer
layer to avoid all these costs in general if nothing has changed.
@StephanDollberg StephanDollberg force-pushed the stephan/greedy-leader-balancer branch 2 times, most recently from d8bd369 to fb0a1ab Compare March 24, 2026 16:14
Add unit tests for various partition distribution scenarios.
For the purposes of simplicity just take the preference always.
Adds a basic DT integration test for one of the scenarios where the
other strategies would not get a perfect topic aware distribution.
The MultiTopicAutomaticLeadershipBalancingTest test was previously not
very strict in what it enforces (well it wouldn't pass otherwise).

Add some optional enforcement such that we check that the greedy
strategy actually gets perfect distribution.
Now that we actually enforce the target leadership we need to make the
test a lot more robust.

Increase timeouts and add more targeted intermediate steps.

Main source of flakiness here is that after restart some of the nodes
get muted so the leadership stabilization is getting delayed.

```
task rp:run-bazel-ducktape-tests DUCKTAPE_ARGS="tests/rptest/tests/leadership_transfer_test.py::GreedyLeaderBalancingTest --repeat=20 --max-parallel 10" BUILD_TYPE=release

...

SESSION REPORT (ALL TESTS)
ducktape version: 0.12.0
session_id:       2026-03-20--009
run time:         6 minutes 11.530 seconds
tests run:        20
passed:           20
flaky:            0
failed:           0
ignored:          0
```
@StephanDollberg StephanDollberg force-pushed the stephan/greedy-leader-balancer branch from fb0a1ab to 8d9db53 Compare March 24, 2026 16:17
@travisdowns travisdowns self-requested a review March 24, 2026 16:25

@travisdowns travisdowns left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM!

@StephanDollberg StephanDollberg merged commit ac8a0fc into dev Mar 24, 2026
21 checks passed
@StephanDollberg StephanDollberg deleted the stephan/greedy-leader-balancer branch March 24, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants