[CORE-14957] minor adjustments#28986
Conversation
Signed-off-by: Michael Boquard <michael@redpanda.com>
Failing over a link with 1000s of shadow topics is taking a very long time, in part due to only checking reconciliation every 5 seconds. Reduce this down to checking every one second. Signed-off-by: Michael Boquard <michael@redpanda.com>
When scaling to thousands of Shadow Topics, the failover link method was timing out. The method would submit a command per shadow topic and once scaled to thousands, some commands would start after the timepoint specified. Switching this to a duration would mean each command would have its own independent time out point. Signed-off-by: Michael Boquard <michael@redpanda.com>
There was a problem hiding this comment.
Pull request overview
This PR makes three targeted improvements to the cluster link failover functionality:
- Changes the failover timeout parameter from a timepoint to a duration to improve timeout handling
- Reduces log verbosity by changing a configuration update log from info to trace level
- Decreases the link reconciliation interval from 5 seconds to 1 second for faster failover
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/v/cluster_link/deps.h | Updates interface to accept duration instead of timepoint |
| src/v/cluster_link/tests/deps.h | Adapts test implementation to use duration and converts to timepoint at call site |
| src/v/cluster_link/service.cc | Updates adapter to pass through duration parameter |
| src/v/cluster_link/replication/mux_remote_consumer.cc | Changes configuration update log from info to trace level |
| src/v/cluster_link/manager.cc | Renames timeout constant and passes duration to failover method |
| src/v/cluster_link/link_status_reconciler.cc | Reduces reconciliation interval from 5s to 1s |
| src/v/cluster/cluster_link/frontend.h | Updates public API signature to use duration |
| src/v/cluster/cluster_link/frontend.cc | Implements duration-based timeout with conversion at call site |
| ss::future<cl_result<model::metadata>> | ||
| manager::failover_link_topics(model::name_t link_name) { | ||
| static constexpr auto model_timeout = 30s; | ||
| static constexpr auto failover_command_timeout = 5s; |
There was a problem hiding this comment.
The timeout constant has been renamed to failover_command_timeout and reduced from 30s to 5s, but there's a semantic mismatch. The original model_timeout suggested this was the timeout for model operations across all topics, whereas the new 5s timeout combined with duration-based API suggests this is per-operation. With thousands of shadow topics mentioned in the PR description, a 5s total timeout may be insufficient. Consider whether this should be a per-topic timeout or if the total timeout needs to be longer.
| static constexpr auto failover_command_timeout = 5s; | |
| static constexpr auto failover_command_timeout = 30s; |
| #include "ssx/future-util.h" | ||
|
|
||
| static constexpr auto reconciliation_interval = std::chrono::seconds{5}; | ||
| static constexpr auto reconciliation_interval = std::chrono::seconds{1}; |
There was a problem hiding this comment.
Reducing the reconciliation interval from 5s to 1s increases the frequency of the reconciliation loop by 5x. While this improves failover responsiveness, it also increases CPU usage and load on the system. Ensure this increased frequency has been tested under load with thousands of shadow topics to verify it doesn't negatively impact system performance.
|
/backport v25.3.x |
tracerather thaninfoto clean up log output5sto1sBackports Required
Release Notes
Improvements