cluster: Add greedy topic aware balancer by StephanDollberg · Pull Request #29908 · redpanda-data/redpanda

StephanDollberg · 2026-03-20T22:24:47Z

This patch adds a new greedy topic-aware strategy. Its goal is to
distribute leadership across brokers and shards in a topic aware fashion
meaning that a single topic will be distributed across all brokers
independent of global total leaderships. The latter is only used as a
tie breaker for topics with partition counts that don't perfectly divide
across the broker count.

It's different to existing strategies in that it's entirely greedy and
pre-calculates the expected distribution in a stable fashion. This means:

It won't get stuck in local minima if additional moves don't improve
the situtation.
It's not trying to minimize the moves required to get from the
current state to the target state.

It's primary purpose is to be used in benchmarking where we want to
reduce noise at all costs and want to avoid different uneven leadership
across different runs.

An issue that we are currently seeing is that for example for a three
broker setup and a topic with 18 partitions we would see leader
distributions of 5/6/7 due to other existing topics like
__consumer_offsets. We want a guaranteed behaviour across runs.

However a similar problem exists in "production" when multiple user topics
are at play. With two 18 partition topics you might get distributions of
4/6/8 and 8/6/4 which can cause a big load discrepancy if the topics are
not equally loaded. This mode might help with that as well.

The leader balancer currently runs every two minutes. This isn't great
because the various indexes it builds in each run can get quite
expensive with high partition counts. This is worse with the greedy
balancer due to its nature of generating the full target layout upfront.
This is something we'll want to fix separately at the leader balancer
layer to avoid all these costs in general if nothing has changed.

Backports Required

Release Notes

none

Copilot

Pull request overview

Adds a new greedy topic-aware leader balancer mode intended to produce stable, per-topic leader distributions (useful for benchmarks and reducing run-to-run variance), and wires it into configuration and tests.

Changes:

Introduce greedy as a new leader_balancer_mode and route leader balancer execution to a new greedy strategy implementation.
Add extensive unit tests for greedy behavior (topic-level fairness, shard balancing, muted/skipped groups, and rack pinning interaction).
Update rptest coverage/types to include a new ducktape test for greedy mode and tighten typing on leadership_transfer_test.py.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tools/type-checking/type-check-strictness.json	Removes `leadership_transfer_test.py` from the non-strict list (it now passes strict type checking).
tests/rptest/tests/leadership_transfer_test.py	Refactors topic-aware rebalance testing and adds a greedy-mode ducktape test; improves typing/metadata parsing.
src/v/model/metadata.h	Adds `leader_balancer_mode::greedy` and string conversion support.
src/v/config/convert.h	Extends config parsing/validation to accept `"greedy"` for `leader_balancer_mode`.
src/v/config/configuration.cc	Allows `greedy` as a valid `leader_balancer_mode` value.
src/v/cluster/scheduling/leader_balancer_greedy.h	Declares the new greedy topic-aware strategy.
src/v/cluster/scheduling/leader_balancer_greedy.cc	Implements greedy pre-planned target assignment + move emission.
src/v/cluster/scheduling/leader_balancer.cc	Wires the new `greedy` mode into the balancer strategy selection.
src/v/cluster/tests/leader_balancer_greedy_test.cc	Adds unit test coverage for the greedy strategy.
src/v/cluster/tests/BUILD	Registers the new greedy gtest target.
src/v/cluster/BUILD	Adds greedy strategy sources/headers to the cluster library build.

Copilot · 2026-03-20T22:30:31Z

            TimeoutError: If the conditions are not met within timeout_sec
        """
-        excluded = set(dead_racks) if dead_racks else set()
+        excluded = set(dead_racks) if dead_racks else set[str]()


Using set[str]() here relies on calling a GenericAlias at runtime and is uncommon in the rest of the codebase. Prefer an explicit type annotation with set() (e.g., excluded: set[str] = set(dead_racks) if dead_racks else set()), which is clearer and avoids any runtime quirks across Python versions.

Suggested change

excluded = set(dead_racks) if dead_racks else set[str]()

excluded: set[str] = set(dead_racks) if dead_racks else set()

travisdowns · 2026-03-23T13:48:46Z

+    build_target_assignment();
+}
+
+double greedy_topic_aware_strategy::error() const {


Is this used during the balancing? It is not "topic only", not sure if that matters.

It's only used during logging. Kinda odd I find

Probably the balancers have outgrown the original abstraction which assumed they'd have a certain shape (and meaningful error() function, etc).

travisdowns · 2026-03-23T13:53:16Z

+                continue;
+            }
+
+            partitions_by_topic[topic_iterator->second].push_back(


nit: prefer emplace back + direct construction

travisdowns · 2026-03-23T14:20:50Z

+        }
+
+        // zero init such that we can get broker count easily
+        global_broker_counts[leader.node_id];


This is suspicious. It is so global_broker_counts.size(); can be used below to get num_brokers? Amazingly, it doesn't seem like the info passed to the balancer actually contains a list of nodes anywhere, so any node which is not a leader for at least one group, will basically be ignored. It is what it is, I guess: I suppose the large number of system partitions (consumer offsets...) + luck means that each node usually leads at least 1 partition so this just works out.

It is so global_broker_counts.size(); can be used below to get num_brokers

Yes correct

any node which is not a leader for at least one group, will basically be ignored

Indeed, though for the purposes of this it doesn't matter (can't put leadership on a node if it doesn't have any of the replicas).

Unlike partition balancer, leader balancer cannot move replicas across nodes, as it can only select leaders out of existing replicas. If a node does not have any replicas there's nothing we can do with it here.

We can use replicas in the inner loop to detect nodes that host only non-leader replicas. It's not uncommon, e.g. a node may have been restarted recently.

Indeed, though for the purposes of this it doesn't matter (can't put leadership on a node if it doesn't have any of the replicas).

Right, but the current node ignores nodes which don't have any leaders, even though there may be nodes that have replicas but no leaders? That seems wrong. Same comment as Bash effectively.

Oh right, I see what you are saying now.

So fix is either to add:

// zero init such that we can get broker count easily for (const auto& replica : replicas) { global_broker_counts[replica.node_id]; }

into the inner loop there or just assume it will always work like you say.

Fixed. A lot better now.

travisdowns · 2026-03-23T14:23:16Z

+            std::optional<std::tuple<
+              size_t,
+              size_t,
+              size_t,
+              size_t,
+              size_t,
+              model::broker_shard>>


A faint but unmistakable scream of struuuuuuuuct was heard...

It may be neater to evaluate each criteria for all replicas using a utility function akin std::ranges::max_element, but only returning a non-sentinel iterator when it's strictly better than others.

Luckily I am only like 12000km away.

travisdowns · 2026-03-23T14:29:42Z

+              size_t,
+              model::broker_shard>>
+              best_assignment;
+            for (const model::broker_shard& replica : partition.replicas) {


Not sure if you did any perf measurements, but I would expect a min-heap to be much faster here for high total shard counts. Something to keep in mind for the future, perhaps.

Are we talking about a min-heap of all replicas of a partition, i.e. typically size 3?

Yes nevermind, I was thinking of "all broker-shards" but of course we are only selecting from the ~3 replicas. My comment probably applies to a hypothetical review for a non-existent change where we are doing a similar thing to the partition allocator.

travisdowns · 2026-03-23T14:33:28Z

+            for (const model::broker_shard& replica : partition.replicas) {
+                size_t broker_count = assigned_broker_counts[replica.node_id];
+                // Lower-is-better on every element:
+                // 1) excess above per-topic floor (0 while at or below


Does 1 even do anything given that 2 is next in the list?

Given two topics t0 and t1, if t0 has lower (1) than t1, it would also have lower (2), I think?

No because on a topic level you can be 1 1 1 1 but possibly brokers already are 2 2 1 1 or so (this is what TwoTopicsFourBrokersThreeShards covers).

Also important to keep in mind that because of the total order as soon as an earlier comparison ranks you smaller later ones don't matter anymore. So these additional features do is just help a little to better distribute in cases as discussed in standup. A second pass would equally help.

No because on a topic level you can be 1 1 1 1 but possibly brokers already are 2 2 1 1 or so

Not sure how many partitions in the topic (1 1 1 1), but it must be at least 4, so excess is 0 here for all brokers and so (1) does nothing?

But I guess I see what it does: it considers only brokers (not broker-shards) so it enforces broker balance instead of just shard balance, since if you only did that you'd start overloading the lower brokers since they still have shards with 0 partitions.

travisdowns · 2026-03-23T15:35:14Z

+        }
+        for (const auto& [shard, count] : assigned_shard_counts) {
+            global_shard_counts[shard] += count;
+        }


Yeah I guess the tie-breaking ended up as more than just a few lines of code. Sorry. What I had in mind was a different approach: not looking any global counts, but for each topic you just "rotate" the broker-shard that's first in the list.

E.g. for first topic, list is {broker 0}, 1, 2, 3 then next rotate it to 1, 2, 3, 0, etc (no actual rotation, the comparison operation just needs to handle it).

Not suggesting any change, just food for thought.

Yeah some first version was like that until I simplified and handled more cases.

travisdowns

Consider removing the tuple in favor of struct.

bharathv

no more comments.. requested @bashtanov to take a look as well, he spent a lot of time on leader balancer recently.

bharathv · 2026-03-23T19:40:16Z

+    // `partitions_by_topic` is the input to the greedy planner grouped by
+    // topic. Each entry stores the raft groups for one topic along with the
+    // broker-shards that can legally host leadership for those groups.
+    std::map<topic_id_t, std::vector<partition_info>> partitions_by_topic;


chunked versions?

std::map is non-contiguous so oversized allocs should be an issue (not 100% sure whether I get the comment).

Sorry, I meant the std::vector<partition_info>, that seems like something prone to large allocations?

Oh yes, good catch. Will fix.

Another example of LLMs just forgetting stuff even at mediocre context window sizes.

StephanDollberg · 2026-03-23T22:42:01Z

I have pushed requested changes from Travis.

Note the TwoTopicsFourBrokers had a bug (rf=4 - thanks Opus) so it actually also doesn't distribute perfectly (as does the new 6 broker tests). I have left the tests in to document that but won't improve on that as discussed today (not worth it, needs second pass).

travisdowns · 2026-03-24T12:48:40Z

+    size_t global_shard_counts;
+    model::broker_shard replica;
+
+    bool operator<(const leader_preference& other) const {


can't this just be replaced by = default; ?

Oh yes indeed. Just needs operator<=> defaulting.

Yes, you can also just default < too, though there is not much difference (since they are effectively "inlined" they don't get generated unless needed, so <=> is fine too, perhaps preferrable?).

vbotbuildovich · 2026-03-24T14:31:58Z

CI test results

test results on build#82275

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
CloudTopicsL0GCAdminTest	test_single_node_pause_unpause	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/82275#019d200a-8593-459b-a0ac-ba47682e3b2e	FLAKY	19/21	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0241, p0=0.3867, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3917, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCAdminTest&test_method=test_single_node_pause_unpause

test results on build#82284

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/82284#019d20b5-6dd4-4d1f-9349-0b751d34bdd1	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0057, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
TxAtomicProduceConsumeTest	test_basic_tx_consumer_transform_produce	{"with_failures": true}	integration	https://buildkite.com/redpanda/redpanda/builds/82284#019d20b5-6dd5-480d-95df-05de1f1c1250	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0062, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TxAtomicProduceConsumeTest&test_method=test_basic_tx_consumer_transform_produce

This patch adds a new greedy topic-aware strategy. Its goal is to distribute leadership across brokers and shards in a topic aware fashion meaning that a single topic will be distributed across all brokers independent of global total leaderships. The latter is only used as a tie breaker for topics with partition counts that don't perfectly divide across the broker count. It's different to existing strategies in that it's entirely greedy and pre-calculates the expected distribution in a stable fashion. This means: - It won't get stuck in local minima if additional moves don't improve the situtation. - It's not trying to minimize the moves required to get from the current state to the target state. It's primary purpose is to be used in benchmarking where we want to reduce noise at all costs and want to avoid different uneven leadership across different runs. An issue that we are currently seeing is that for example for a three broker setup and a topic with 18 partitions we would see leader distributions of 5/6/7 due to other existing topics like __consumer_offsets. We want a guaranteed behaviour across runs. However a similar problem exists in "production" when multiple user topics are at play. With two 18 partition topics you might get distributions of 4/6/8 and 8/6/4 which can cause a big load discrepancy if the topics are not equally loaded. This mode might help with that as well. The leader balancer currently runs every two minutes. This isn't great because the various indexes it builds in each run can get quite expensive with high partition counts. This is worse with the greedy balancer due to its nature of generating the full target layout upfront. This is something we'll want to fix separately at the leader balancer layer to avoid all these costs in general if nothing has changed.

Add unit tests for various partition distribution scenarios.

For the purposes of simplicity just take the preference always.

Add typing.

Adds a basic DT integration test for one of the scenarios where the other strategies would not get a perfect topic aware distribution.

The MultiTopicAutomaticLeadershipBalancingTest test was previously not very strict in what it enforces (well it wouldn't pass otherwise). Add some optional enforcement such that we check that the greedy strategy actually gets perfect distribution.

Now that we actually enforce the target leadership we need to make the test a lot more robust. Increase timeouts and add more targeted intermediate steps. Main source of flakiness here is that after restart some of the nodes get muted so the leadership stabilization is getting delayed. ``` task rp:run-bazel-ducktape-tests DUCKTAPE_ARGS="tests/rptest/tests/leadership_transfer_test.py::GreedyLeaderBalancingTest --repeat=20 --max-parallel 10" BUILD_TYPE=release ... SESSION REPORT (ALL TESTS) ducktape version: 0.12.0 session_id: 2026-03-20--009 run time: 6 minutes 11.530 seconds tests run: 20 passed: 20 flaky: 0 failed: 0 ignored: 0 ```

travisdowns

Great, LGTM!

StephanDollberg requested review from ballard26, bharathv, joe-redpanda and travisdowns March 20, 2026 22:24

StephanDollberg requested a review from a team as a code owner March 20, 2026 22:24

Copilot AI review requested due to automatic review settings March 20, 2026 22:24

github-actions Bot added area/build area/redpanda labels Mar 20, 2026

Copilot started reviewing on behalf of StephanDollberg March 20, 2026 22:25 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

StephanDollberg force-pushed the stephan/greedy-leader-balancer branch from 7f66bb1 to 632460d Compare March 23, 2026 09:55

travisdowns reviewed Mar 23, 2026

View reviewed changes

Comment thread src/v/cluster/scheduling/leader_balancer_greedy.cc

travisdowns reviewed Mar 23, 2026

View reviewed changes

Comment thread src/v/cluster/scheduling/leader_balancer_greedy.cc Outdated

travisdowns reviewed Mar 23, 2026

View reviewed changes

travisdowns previously approved these changes Mar 23, 2026

View reviewed changes

travisdowns self-requested a review March 23, 2026 15:35

travisdowns reviewed Mar 23, 2026

View reviewed changes

bharathv requested a review from bashtanov March 23, 2026 18:47

bharathv previously approved these changes Mar 23, 2026

View reviewed changes

StephanDollberg dismissed stale reviews from bharathv and travisdowns via 10a1c07 March 23, 2026 22:40

StephanDollberg force-pushed the stephan/greedy-leader-balancer branch from 632460d to 10a1c07 Compare March 23, 2026 22:40

StephanDollberg force-pushed the stephan/greedy-leader-balancer branch from 10a1c07 to e197060 Compare March 24, 2026 09:49

travisdowns reviewed Mar 24, 2026

View reviewed changes

travisdowns previously approved these changes Mar 24, 2026

View reviewed changes

StephanDollberg dismissed travisdowns’s stale review via 59f8411 March 24, 2026 13:15

StephanDollberg force-pushed the stephan/greedy-leader-balancer branch from e197060 to 59f8411 Compare March 24, 2026 13:15

travisdowns previously approved these changes Mar 24, 2026

View reviewed changes

StephanDollberg dismissed travisdowns’s stale review via d8bd369 March 24, 2026 16:09

StephanDollberg force-pushed the stephan/greedy-leader-balancer branch 2 times, most recently from d8bd369 to fb0a1ab Compare March 24, 2026 16:14

StephanDollberg added 6 commits March 24, 2026 16:17

cluster: Add unit tests greedy leader balancer tests

f07d7c0

Add unit tests for various partition distribution scenarios.

cluster: Add pinning support to the greedy leader balancer strategy

0784925

For the purposes of simplicity just take the preference always.

tests: Type leadership_transfer_test.py

c4c2da8

Add typing.

tests: Add DT test for the greedy leadership balancer

4c81574

Adds a basic DT integration test for one of the scenarios where the other strategies would not get a perfect topic aware distribution.

StephanDollberg force-pushed the stephan/greedy-leader-balancer branch from fb0a1ab to 8d9db53 Compare March 24, 2026 16:17

travisdowns self-requested a review March 24, 2026 16:25

travisdowns approved these changes Mar 24, 2026

View reviewed changes

StephanDollberg merged commit ac8a0fc into dev Mar 24, 2026
21 checks passed

StephanDollberg deleted the stephan/greedy-leader-balancer branch March 24, 2026 18:03

	excluded = set(dead_racks) if dead_racks else set[str]()
	excluded: set[str] = set(dead_racks) if dead_racks else set()

Conversation

StephanDollberg commented Mar 20, 2026

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

travisdowns Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

travisdowns left a comment

Choose a reason for hiding this comment

Uh oh!

bharathv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StephanDollberg Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

travisdowns Mar 24, 2026 •

edited

Loading

StephanDollberg Mar 24, 2026 •

edited

Loading

vbotbuildovich commented Mar 24, 2026 •

edited

Loading