cluster/observability: fix empty partition in decom / reconfiguration status ouputs by bharathv · Pull Request #29047 · redpanda-data/redpanda

bharathv · 2025-12-18T03:53:58Z

Often we seen outputs like this

DECOMMISSION PROGRESS
=====================
PARTITION                        MOVING-TO  COMPLETION-%  PARTITION-SIZE
kafka/prod_observations/1        26         0             0
kafka/prod_observations/8        26         0             0
kafka/prod_observations/19       26         0             0
kafka/prod_observations/28       26         0             0

Sometimes the partition size info and completion percentage don’t show up even though the partition is actually moving just fine. This happens because the recovery state isn’t being polled from the leader; instead, it’s gathered from the shards local to whatever node you’re querying. If that node doesn’t have the replica or isn’t the leader, the info is missing. The fix is to route reconciliation state queries to the leader node instead.

Fixes: https://redpandadata.atlassian.net/browse/CORE-14975

Backports Required

Release Notes

Bug Fixes

Fixes incorrect reporting of partition movement progress in decommission/reconfiguration status output.

Copilot

Pull request overview

This PR fixes an issue where decommissioning and reconfiguration status outputs show empty partition metadata (partition size and completion percentage) even when partitions are actively moving. The root cause was that reconciliation state was being queried from local shards instead of partition leaders, resulting in incomplete information when the queried node doesn't host the replica or isn't the leader.

Key changes:

Routes reconciliation state queries to partition leaders instead of local shards
Adds local_size to serialization fields in recovery_state
Adds comprehensive test coverage for decommission status reporting

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/v/cluster/controller_api.h`	Adds `partition_leaders_table` dependency and new method `get_partition_leader_reconciliation_state`
`src/v/cluster/controller_api.cc`	Implements leader-based reconciliation state queries and refactors partition reconfiguration state gathering to use concurrent queries
`src/v/cluster/types.h`	Updates `recovery_state` serde version and includes `local_size` in serialization
`src/v/cluster/types.cc`	Updates `recovery_state` output operator to include `local_size`
`src/v/redpanda/admin/partition.cc`	Changes admin handler to use new leader-based reconfiguration state API
`src/v/cluster/controller.cc`	Passes `partition_leaders` reference to `controller_api` constructor
`tests/rptest/tests/nodes_decommissioning_test.py`	Adds test validating decommission status is properly reported across all nodes

Copilot · 2025-12-18T03:55:00Z

+    co_await ss::max_concurrent_for_each(
+      partitions,
+      16,


The magic number 16 for max concurrency should be extracted as a named constant or configuration parameter to make its purpose clear and allow easier tuning.

Copilot · 2025-12-18T03:55:00Z

+            err_msg="Decommission status not reported as in_progress on all nodes",
+            retry_on_exc=True,
+        )
+        self._set_recovery_rate(2 << 30)


The magic number 2 << 30 (2GB) should be extracted as a named constant to clarify that this is setting a high recovery rate to allow decommissioning to complete.

vbotbuildovich · 2025-12-18T06:29:02Z

CI test results

test results on build#78085

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
DatalakeDLQTest	test_dlq_table_for_mixed_records	{"catalog_type": "rest_jdbc", "cloud_storage_type": 1, "query_engine": "spark"}	integration	https://buildkite.com/redpanda/redpanda/builds/78085#019b2fc6-cb9d-4790-ad6e-282d192fb02e	FLAKY	50/51	Test PASSES after retries.Inconclusive result after max retries(baseline=0.0021, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.0064, p1=0.7241, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeDLQTest&test_method=test_dlq_table_for_mixed_records
MountUnmountIcebergTest	test_simple_remount	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/78085#019b2fcf-95dd-42b4-aa1e-3fd2d65f7fcb	FLAKY	7/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.1707, p0=0.2362, reject_threshold=0.0100. adj_baseline=0.4297, p1=0.3108, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount

vbotbuildovich · 2025-12-18T15:32:43Z

/backport v25.3.x

vbotbuildovich · 2025-12-18T15:32:44Z

/backport v25.2.x

vbotbuildovich · 2025-12-18T15:33:59Z

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-29047-v25.2.x-549 remotes/upstream/v25.2.x
git cherry-pick -x 29c6050df1 124201459b f8c76ea882 cab8675039 bfd976166c da299be686 aa11692ada

Workflow run logs.

bharathv added 7 commits December 17, 2025 19:47

cluster/types: fix missing entry in serde_fields

29c6050

cluster/types: update ostream op

1242014

cluster/controller_api: wire up partition_leaders_table

f8c76ea

controller/api: introduce get_partition_leader_reconciliation_state

cab8675

controller/api: fetch reconfiguration states from leader

bfd9761

controller/api: parallelize reconcliation state fetches

da299be

cluster/decomm_status: simple test for validating status output

aa11692

Copilot AI review requested due to automatic review settings December 18, 2025 03:53

github-actions Bot added the area/redpanda label Dec 18, 2025

Copilot AI reviewed Dec 18, 2025

View reviewed changes

bharathv requested review from bashtanov, joe-redpanda and mmaslankaprv December 18, 2025 03:57

mmaslankaprv approved these changes Dec 18, 2025

View reviewed changes

bharathv merged commit 7ee2fcd into redpanda-data:dev Dec 18, 2025
22 of 23 checks passed

bharathv deleted the fix_decom_status branch December 18, 2025 15:32

This was referenced Dec 18, 2025

[v25.3.x] cluster/observability: fix empty partition in decom / reconfiguration status ouputs #29055

Merged

[v25.2.x] cluster/observability: fix empty partition in decom / reconfiguration status ouputs #29056

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster/observability: fix empty partition in decom / reconfiguration status ouputs#29047

cluster/observability: fix empty partition in decom / reconfiguration status ouputs#29047
bharathv merged 7 commits into
redpanda-data:devfrom
bharathv:fix_decom_status

bharathv commented Dec 18, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

vbotbuildovich commented Dec 18, 2025

Uh oh!

Uh oh!

vbotbuildovich commented Dec 18, 2025

Uh oh!

vbotbuildovich commented Dec 18, 2025

Uh oh!

vbotbuildovich commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bharathv commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Bug Fixes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Dec 18, 2025

CI test results

Uh oh!

Uh oh!

vbotbuildovich commented Dec 18, 2025

Uh oh!

vbotbuildovich commented Dec 18, 2025

Uh oh!

vbotbuildovich commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bharathv commented Dec 18, 2025 •

edited

Loading