[v26.1.x] [CORE-15812] ct/scale: raise fetch concurrency for cloud_topics MPT variants by vbotbuildovich · Pull Request #30867 · redpanda-data/redpanda

vbotbuildovich · 2026-06-23T01:05:41Z

Backport of PR #30861

Command: git cherry-pick -x d4a3eaa 5392acc
Commits backported: 2
Conflicts resolved: 1
Commits skipped (already on target): 0
Backport branch: ai-backport-pr-30861-v26.1.x-1782176639

Conflict details

d4a3eaa (tests/rptest/scale_tests/many_partitions_test.py): the commit adds a fetch_max_read_concurrency: 4 config under the if cloud_topics_enabled: guard, but the target branch already has a CLOUD_TOPICS_CONFIG_STR: True setting in that same guarded block (not present on dev). Merged both into a single if cloud_topics_enabled: block, keeping the target's existing setting and adding the new concurrency config.

The cloud_topics ManyPartitionsTest variants (regular and tiered_storage) time out at the default fetch_max_read_concurrency=1. Set it to 4 for the cloud_topics path. A cloud-topic partition read is latency-bound: per L1 object it does a metastore lookup, a footer read, and -- on a cache miss -- an object-store GET to populate the local cache. That GET dominates (~70-90% of read wall-clock) and is a round-trip cost, not bandwidth: aggregate throughput with concurrency is ~170 MB/s, far below object-store bandwidth. fetch_max_read_concurrency caps how many partition reads in a fetch run concurrently. At the default of 1 a broker reads its hundreds of partitions serially, so the per-read round-trips serialize; across tens of thousands of small objects that is tens of minutes of pure latency and the consumer never keeps up. The regular (local-log) topic in the same test drains fine because a local read is not a latency-bound network op. Running the reads concurrently pipelines the GETs and hides the latency. Not the cause: - Cache size: cranking the cloud cache (2.4->50 GiB) and the L1 reader cache (128->10k) did not help; the reads are round-trip-bound, not cache-size-bound, and a high-concurrency run drains with default caches. - The L1 reader cache cannot warm here: each small partition is read to completion in one pass, so a finished reader has nothing to reuse. - The max_bytes early-return was suspected, but reverting it still failed -- it is not the cause. This is a scale limitation of the default serial read, not a regression. Follow-up: the many-small-partition read pattern is inherently inefficient (each GET amortizes over little data, readers are uncacheable). A real optimization -- coalescing reads across partitions co-resident in one L1 object, or prefetching object downloads off the fetch critical path -- is tracked separately. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com> (cherry picked from commit d4a3eaa)

SlowDown/InternalError/RequestTimeout are retried by the caller (matching send_request); logging them at error tripped test log scanners in scale tests. Also treat reconnect std::system_error and ss::timed_out_error as retryable (warn), matching handle_client_transport_error. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com> (cherry picked from commit 5392acc)

oleiman

clean backport. going to wait for a bit of feedback from the perf team before merging

oleiman added 2 commits June 23, 2026 01:05

vbotbuildovich added this to the v26.1.x-next milestone Jun 23, 2026

vbotbuildovich added the kind/backport PRs targeting a stable branch label Jun 23, 2026

vbotbuildovich requested a review from oleiman June 23, 2026 01:05

github-actions Bot added the area/redpanda label Jun 23, 2026

oleiman self-assigned this Jun 23, 2026

oleiman approved these changes Jun 23, 2026

View reviewed changes

oleiman merged commit 65f73e9 into redpanda-data:v26.1.x Jun 23, 2026
19 checks passed

tyson-redpanda modified the milestones: v26.1.x-next, v26.1.11 Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v26.1.x] [CORE-15812] ct/scale: raise fetch concurrency for cloud_topics MPT variants#30867

[v26.1.x] [CORE-15812] ct/scale: raise fetch concurrency for cloud_topics MPT variants#30867
oleiman merged 2 commits into
redpanda-data:v26.1.xfrom
vbotbuildovich:ai-backport-pr-30861-v26.1.x-1782176639

vbotbuildovich commented Jun 23, 2026

Uh oh!

oleiman left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

vbotbuildovich commented Jun 23, 2026

Conflict details

Uh oh!

oleiman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants