Skip to content

ct/l1/metastore: adjustments to pre-opening and add configurations#30849

Open
andrwng wants to merge 5 commits into
redpanda-data:devfrom
andrwng:lsm-preopen-and-configs
Open

ct/l1/metastore: adjustments to pre-opening and add configurations#30849
andrwng wants to merge 5 commits into
redpanda-data:devfrom
andrwng:lsm-preopen-and-configs

Conversation

@andrwng

@andrwng andrwng commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

This PR makes a few tweaks to the opening of the metastore:

  • First, it makes the LSM open call a background operation rather than a synchronous operation. For the metastore, the LSM is opened upon becoming partition leader, and it waits for the open to complete before serving requests. Even with cold SSTs it's worth allowing the open to proceed, at least to avoid a window of metastore unavailability.
  • Actually make the metastore use the preopening feature. In a scaled cluster, I've seen this significantly help avoid high metastore latencies on cold starts (empirically, low minutes -> low tens of seconds).
  • Adds configurables for the memtable size and block cache size. The defaults are left as is, but they're useful to have as tunables at least for testing.
  • Related configs are also plumbed into the read replica databases

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • None

Copilot AI review requested due to automatic review settings June 18, 2026 22:47
@andrwng andrwng requested a review from a team as a code owner June 18, 2026 22:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors LSM “pre-open” behavior into an asynchronous background prewarm task and wires new cloud-topics L1 metastore configuration knobs (block cache size, write buffer size, and prewarm concurrency) into the LSM open options.

Changes:

  • Add version::contains() and use it during prewarm to avoid warming SSTs that have fallen out of the live version.
  • Move SST pre-opening from recover() to a best-effort background “prewarm” started after open (including readonly opens), and update the unit test accordingly.
  • Introduce new cloud_topics_metastore_* configuration properties and pass them into metastore/read-replica LSM open options.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/v/lsm/db/version_set.h Adds version::contains() declaration for checking handle liveness in a version.
src/v/lsm/db/version_set.cc Implements version::contains() via linear scan across levels/files.
src/v/lsm/db/impl.h Declares new prewarm helpers (maybe_start_prewarm, prewarm).
src/v/lsm/db/impl.cc Starts prewarm as a background task after open; adds skipping of non-live SSTs; adjusts readonly open path.
src/v/lsm/db/tests/impl_test.cc Updates PreOpenFiles test to account for asynchronous prewarm (draining task queue).
src/v/config/configuration.h Adds new cloud-topics metastore config properties.
src/v/config/configuration.cc Defines new config properties (defaults + bounds + help text).
src/v/cloud_topics/read_replica/snapshot_manager.cc Plumbs metastore prewarm + block cache config into readonly LSM open.
src/v/cloud_topics/level_one/metastore/lsm/replicated_db.cc Plumbs metastore write buffer, prewarm, and block cache config into replicated LSM open.

Comment thread src/v/lsm/db/impl.cc Outdated
Comment thread src/v/config/configuration.cc Outdated
Comment thread src/v/lsm/db/tests/impl_test.cc
@andrwng andrwng force-pushed the lsm-preopen-and-configs branch from 448abd8 to 15d0fcd Compare June 19, 2026 00:23
@vbotbuildovich

vbotbuildovich commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#86019
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) NodeWiseRecoveryTest test_recovery_local_data_missing {"wait_for_final_manifest_uploads": true} integration https://buildkite.com/redpanda/redpanda/builds/86019#019edd4a-0f97-41c9-96fd-244305ca67ae 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0455, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1304, p1=0.2474, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_recovery_local_data_missing
test results on build#86110
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShardPlacementTest test_core_count_change null integration https://buildkite.com/redpanda/redpanda/builds/86110#019ef0ef-baa8-40ca-9fb4-8356bda5b426 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0089, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShardPlacementTest&test_method=test_core_count_change

andrwng added 5 commits June 22, 2026 12:34
recover() previously had the option to pre-open every SST synchronously.
For the metastore (using cloud cache persistenc), this meant it would
download each SST and keep a file handle in the SST cache until it was
evicted.

At smaller scales this would be fine, but with larger SSTs, this
prerequisite to download everything can be unstable, since requests to
the metastore can't be served during this pre-opening phase.

This commit makes this background work instead, such that the initial
opening of the database is quick, but the work to pre-warm everything
else still happens, just in the background.
Adds cloud_topics_metastore_max_pre_open_fibers cluster config, wired
into the lsm::options used by replicated_database::open. When set to
N > 0, lsm::db::impl::recover walks every level's files and warms the
table cache via N concurrent fibers before returning from open.

Because a metastore replicated_database is constructed each time a
node takes leadership for a metastore_partition, this pays the SST
open cost up front on leadership transfer, so subsequent reads hit
warm file handles instead of contending on the per-handle
table_cache loader lock (the cold-path wait covered by
sst_loader_wait_*).

Default is 10 (somewhat arbitrarily chosen, but seems like a decently
balanced value).
Allow for configuring the write buffer size. I haven't used this in
practice, but it can be helpful to e.g. define target sizes within each
level if we need it.
Expose the LSM block cache size as a tunable. I left the default as its
existing default, but allowing it to be tunable makes testing easier and
also gives us flexibility to change it in the future in deployments that
could benefit from it.
Also does some test plumbing to supply a real cache to
snapshot_manager_test, since the pre-warming will require a cache.
@andrwng andrwng force-pushed the lsm-preopen-and-configs branch from 15d0fcd to 4537d17 Compare June 22, 2026 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants