[CORE-14957] minor adjustments by michael-redpanda · Pull Request #28986 · redpanda-data/redpanda

michael-redpanda · 2025-12-15T14:38:39Z

Adjust a logger to be trace rather than info to clean up log output
Switch failover command to use a duration as a parameter rather than timepoint
Adjust link reconciliation period from 5s to 1s

Backports Required

Release Notes

Improvements

Improve timeout logic for failover link to not timeout when processing thousands of shadow topics
Reduce log noise
Increase frequency of topic reconciliation loop to reduce time it takes to failover topics

Signed-off-by: Michael Boquard <michael@redpanda.com>

Failing over a link with 1000s of shadow topics is taking a very long time, in part due to only checking reconciliation every 5 seconds. Reduce this down to checking every one second. Signed-off-by: Michael Boquard <michael@redpanda.com>

When scaling to thousands of Shadow Topics, the failover link method was timing out. The method would submit a command per shadow topic and once scaled to thousands, some commands would start after the timepoint specified. Switching this to a duration would mean each command would have its own independent time out point. Signed-off-by: Michael Boquard <michael@redpanda.com>

Copilot

Pull request overview

This PR makes three targeted improvements to the cluster link failover functionality:

Changes the failover timeout parameter from a timepoint to a duration to improve timeout handling
Reduces log verbosity by changing a configuration update log from info to trace level
Decreases the link reconciliation interval from 5 seconds to 1 second for faster failover

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/v/cluster_link/deps.h	Updates interface to accept duration instead of timepoint
src/v/cluster_link/tests/deps.h	Adapts test implementation to use duration and converts to timepoint at call site
src/v/cluster_link/service.cc	Updates adapter to pass through duration parameter
src/v/cluster_link/replication/mux_remote_consumer.cc	Changes configuration update log from info to trace level
src/v/cluster_link/manager.cc	Renames timeout constant and passes duration to failover method
src/v/cluster_link/link_status_reconciler.cc	Reduces reconciliation interval from 5s to 1s
src/v/cluster/cluster_link/frontend.h	Updates public API signature to use duration
src/v/cluster/cluster_link/frontend.cc	Implements duration-based timeout with conversion at call site

Copilot · 2025-12-15T14:41:30Z

 ss::future<cl_result<model::metadata>>
 manager::failover_link_topics(model::name_t link_name) {
-    static constexpr auto model_timeout = 30s;
+    static constexpr auto failover_command_timeout = 5s;


The timeout constant has been renamed to failover_command_timeout and reduced from 30s to 5s, but there's a semantic mismatch. The original model_timeout suggested this was the timeout for model operations across all topics, whereas the new 5s timeout combined with duration-based API suggests this is per-operation. With thousands of shadow topics mentioned in the PR description, a 5s total timeout may be insufficient. Consider whether this should be a per-topic timeout or if the total timeout needs to be longer.

Suggested change

static constexpr auto failover_command_timeout = 5s;

static constexpr auto failover_command_timeout = 30s;

Copilot · 2025-12-15T14:41:30Z

 #include "ssx/future-util.h"

-static constexpr auto reconciliation_interval = std::chrono::seconds{5};
+static constexpr auto reconciliation_interval = std::chrono::seconds{1};


Reducing the reconciliation interval from 5s to 1s increases the frequency of the reconciliation loop by 5x. While this improves failover responsiveness, it also increases CPU usage and load on the system. Ensure this increased frequency has been tested under load with thousands of shadow topics to verify it doesn't negatively impact system performance.

vbotbuildovich · 2025-12-15T17:08:51Z

CI test results

test results on build#77862

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
PartitionReassignmentsTest	test_reassignments_cancel	null	integration	https://buildkite.com/redpanda/redpanda/builds/77862#019b22a2-fc17-408d-9862-adf1a9433c39	FLAKY	9/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0812, p0=0.5713, reject_threshold=0.0100. adj_baseline=0.2244, p1=0.3068, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionReassignmentsTest&test_method=test_reassignments_cancel
WriteCachingFailureInjectionTest	test_unavoidable_data_loss	null	integration	https://buildkite.com/redpanda/redpanda/builds/77862#019b22ad-9520-4ab4-b774-132fc7151351	FLAKY	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0606, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1709, p1=0.1535, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss

vbotbuildovich · 2025-12-15T19:11:30Z

/backport v25.3.x

michael-redpanda added 3 commits December 15, 2025 08:50

cl/r: Lower log level in update_configuration

7e1001a

Signed-off-by: Michael Boquard <michael@redpanda.com>

sl: Reduce link status reconciler interval

04b4300

Failing over a link with 1000s of shadow topics is taking a very long time, in part due to only checking reconciliation every 5 seconds. Reduce this down to checking every one second. Signed-off-by: Michael Boquard <michael@redpanda.com>

michael-redpanda requested a review from bharathv December 15, 2025 14:38

michael-redpanda self-assigned this Dec 15, 2025

michael-redpanda marked this pull request as ready for review December 15, 2025 14:38

Copilot AI review requested due to automatic review settings December 15, 2025 14:38

github-actions Bot added the area/redpanda label Dec 15, 2025

Copilot AI reviewed Dec 15, 2025

View reviewed changes

bharathv approved these changes Dec 15, 2025

View reviewed changes

michael-redpanda merged commit b1a571c into redpanda-data:dev Dec 15, 2025
20 checks passed

vbotbuildovich mentioned this pull request Dec 15, 2025

[v25.3.x] [CORE-14957] minor adjustments #28997

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE-14957] minor adjustments#28986

[CORE-14957] minor adjustments#28986
michael-redpanda merged 3 commits into
redpanda-data:devfrom
michael-redpanda:sl/core-14957-minor-adjustments

michael-redpanda commented Dec 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 15, 2025

Uh oh!

Copilot AI Dec 15, 2025

Uh oh!

vbotbuildovich commented Dec 15, 2025

Uh oh!

Uh oh!

vbotbuildovich commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	static constexpr auto failover_command_timeout = 5s;
	static constexpr auto failover_command_timeout = 30s;

Conversation

michael-redpanda commented Dec 15, 2025

Backports Required

Release Notes

Improvements

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Dec 15, 2025

CI test results

Uh oh!

Uh oh!

vbotbuildovich commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants