Skip to content

[CORE-14957] minor adjustments#28986

Merged
michael-redpanda merged 3 commits into
redpanda-data:devfrom
michael-redpanda:sl/core-14957-minor-adjustments
Dec 15, 2025
Merged

[CORE-14957] minor adjustments#28986
michael-redpanda merged 3 commits into
redpanda-data:devfrom
michael-redpanda:sl/core-14957-minor-adjustments

Conversation

@michael-redpanda

Copy link
Copy Markdown
Contributor
  • Adjust a logger to be trace rather than info to clean up log output
  • Switch failover command to use a duration as a parameter rather than timepoint
  • Adjust link reconciliation period from 5s to 1s

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Improvements

  • Improve timeout logic for failover link to not timeout when processing thousands of shadow topics
  • Reduce log noise
  • Increase frequency of topic reconciliation loop to reduce time it takes to failover topics

Signed-off-by: Michael Boquard <michael@redpanda.com>
Failing over a link with 1000s of shadow topics is taking a very long
time, in part due to only checking reconciliation every 5 seconds.
Reduce this down to checking every one second.

Signed-off-by: Michael Boquard <michael@redpanda.com>
When scaling to thousands of Shadow Topics, the failover link method was
timing out.  The method would submit a command per shadow topic and once
scaled to thousands, some commands would start after the timepoint
specified.  Switching this to a duration would mean each command would
have its own independent time out point.

Signed-off-by: Michael Boquard <michael@redpanda.com>
@michael-redpanda michael-redpanda self-assigned this Dec 15, 2025
@michael-redpanda michael-redpanda marked this pull request as ready for review December 15, 2025 14:38
Copilot AI review requested due to automatic review settings December 15, 2025 14:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes three targeted improvements to the cluster link failover functionality:

  • Changes the failover timeout parameter from a timepoint to a duration to improve timeout handling
  • Reduces log verbosity by changing a configuration update log from info to trace level
  • Decreases the link reconciliation interval from 5 seconds to 1 second for faster failover

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/v/cluster_link/deps.h Updates interface to accept duration instead of timepoint
src/v/cluster_link/tests/deps.h Adapts test implementation to use duration and converts to timepoint at call site
src/v/cluster_link/service.cc Updates adapter to pass through duration parameter
src/v/cluster_link/replication/mux_remote_consumer.cc Changes configuration update log from info to trace level
src/v/cluster_link/manager.cc Renames timeout constant and passes duration to failover method
src/v/cluster_link/link_status_reconciler.cc Reduces reconciliation interval from 5s to 1s
src/v/cluster/cluster_link/frontend.h Updates public API signature to use duration
src/v/cluster/cluster_link/frontend.cc Implements duration-based timeout with conversion at call site

ss::future<cl_result<model::metadata>>
manager::failover_link_topics(model::name_t link_name) {
static constexpr auto model_timeout = 30s;
static constexpr auto failover_command_timeout = 5s;

Copilot AI Dec 15, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout constant has been renamed to failover_command_timeout and reduced from 30s to 5s, but there's a semantic mismatch. The original model_timeout suggested this was the timeout for model operations across all topics, whereas the new 5s timeout combined with duration-based API suggests this is per-operation. With thousands of shadow topics mentioned in the PR description, a 5s total timeout may be insufficient. Consider whether this should be a per-topic timeout or if the total timeout needs to be longer.

Suggested change
static constexpr auto failover_command_timeout = 5s;
static constexpr auto failover_command_timeout = 30s;

Copilot uses AI. Check for mistakes.
#include "ssx/future-util.h"

static constexpr auto reconciliation_interval = std::chrono::seconds{5};
static constexpr auto reconciliation_interval = std::chrono::seconds{1};

Copilot AI Dec 15, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the reconciliation interval from 5s to 1s increases the frequency of the reconciliation loop by 5x. While this improves failover responsiveness, it also increases CPU usage and load on the system. Ensure this increased frequency has been tested under load with thousands of shadow topics to verify it doesn't negatively impact system performance.

Copilot uses AI. Check for mistakes.
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#77862
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
PartitionReassignmentsTest test_reassignments_cancel null integration https://buildkite.com/redpanda/redpanda/builds/77862#019b22a2-fc17-408d-9862-adf1a9433c39 FLAKY 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0812, p0=0.5713, reject_threshold=0.0100. adj_baseline=0.2244, p1=0.3068, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionReassignmentsTest&test_method=test_reassignments_cancel
WriteCachingFailureInjectionTest test_unavoidable_data_loss null integration https://buildkite.com/redpanda/redpanda/builds/77862#019b22ad-9520-4ab4-b774-132fc7151351 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0606, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1709, p1=0.1535, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionTest&test_method=test_unavoidable_data_loss

@michael-redpanda michael-redpanda merged commit b1a571c into redpanda-data:dev Dec 15, 2025
20 checks passed
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants