Skip to content

[CORE-14829] Cloud Topics: Admin RPCs GetEpochInfo and AdvanceEpoch#29535

Merged
oleiman merged 6 commits into
redpanda-data:devfrom
oleiman:ct/core-14829/epoch-advance-admin-apis
Feb 19, 2026
Merged

[CORE-14829] Cloud Topics: Admin RPCs GetEpochInfo and AdvanceEpoch#29535
oleiman merged 6 commits into
redpanda-data:devfrom
oleiman:ct/core-14829/epoch-advance-admin-apis

Conversation

@oleiman

@oleiman oleiman commented Feb 4, 2026

Copy link
Copy Markdown
Member

Builds on #29536

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

@oleiman oleiman self-assigned this Feb 4, 2026
@oleiman oleiman force-pushed the ct/core-14829/epoch-advance-admin-apis branch from 2e92733 to ca8cc51 Compare February 4, 2026 22:28
@oleiman oleiman changed the title Ct/core 14829/epoch advance admin apis [CORE-14829] Cloud Topics: Admin RPCs GetEpochInfo and AdvanceEpoch Feb 4, 2026
@oleiman oleiman force-pushed the ct/core-14829/epoch-advance-admin-apis branch from ca8cc51 to 0e11b3a Compare February 4, 2026 22:55
@oleiman oleiman marked this pull request as ready for review February 4, 2026 23:04
Copilot AI review requested due to automatic review settings February 4, 2026 23:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds admin RPCs for advancing epochs and querying epoch information in cloud topics, enabling manual control of GC progress on idle partitions.

Changes:

  • Adds AdvanceEpoch and GetEpochInfo admin RPCs to the level zero GC service
  • Implements advance_epoch command in the ctp_stm state machine
  • Adds frontend methods to expose epoch advancement and info retrieval

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/rptest/tests/cloud_topics/l0_gc_test.py Adds test for advance_epoch RPC functionality
tests/rptest/clients/admin/proto/.../level_zero_gc_pb2_connect.py Generated protobuf client methods for new RPCs
tests/rptest/clients/admin/proto/.../level_zero_gc_pb2.pyi Generated protobuf type stubs for new messages
tests/rptest/clients/admin/proto/.../level_zero_gc_pb2.py Generated protobuf serialization code
src/v/redpanda/application_start.cc Passes cluster services to ctp_stm_factory
src/v/redpanda/application_admin.cc Provides partition manager and related services to GC service
src/v/redpanda/admin/services/internal/level_zero_gc.h Adds method signatures for new RPCs
src/v/redpanda/admin/services/internal/level_zero_gc.cc Implements advance_epoch and get_epoch_info RPCs with leader proxying
src/v/redpanda/admin/services/internal/BUILD Adds dependencies for frontend and state accessors
src/v/cloud_topics/level_zero/stm/types.h Adds advance_epoch command key enum value
src/v/cloud_topics/level_zero/stm/types.cc Adds formatting for advance_epoch key
src/v/cloud_topics/level_zero/stm/tests/ctp_stm_test.cc Adds tests for advance_epoch and sync_to_next_placeholder behavior
src/v/cloud_topics/level_zero/stm/ctp_stm_state.h Exposes current_epoch_window_offset accessor
src/v/cloud_topics/level_zero/stm/ctp_stm_state.cc Implements current_epoch_window_offset accessor
src/v/cloud_topics/level_zero/stm/ctp_stm_factory.h Adds cluster_services member to factory
src/v/cloud_topics/level_zero/stm/ctp_stm_factory.cc Passes cluster services to ctp_stm constructor
src/v/cloud_topics/level_zero/stm/ctp_stm_commands.h Defines advance_epoch_cmd structure
src/v/cloud_topics/level_zero/stm/ctp_stm_api.h Adds API methods for advance_epoch and sync_to_next_placeholder
src/v/cloud_topics/level_zero/stm/ctp_stm_api.cc Implements advance_epoch and sync_to_next_placeholder methods
src/v/cloud_topics/level_zero/stm/ctp_stm.h Adds cluster_services parameter to constructor
src/v/cloud_topics/level_zero/stm/ctp_stm.cc Applies advance_epoch commands to state machine
src/v/cloud_topics/level_zero/stm/BUILD Adds cluster_services dependency
src/v/cloud_topics/frontend/tests/frontend_test.cc Adds test for frontend advance_epoch integration
src/v/cloud_topics/frontend/frontend.h Adds epoch_info struct and advance_epoch method
src/v/cloud_topics/frontend/frontend.cc Implements advance_epoch and get_epoch_info methods
src/v/cloud_topics/frontend/BUILD Adds types dependency
src/v/cloud_topics/app.h Adds cluster_services member
src/v/cloud_topics/app.cc Constructs and exposes cluster_services
src/v/cloud_topics/BUILD Adds cluster_services_impl dependency
proto/.../level_zero_gc.proto Defines AdvanceEpoch and GetEpochInfo RPCs and messages

Comment thread src/v/redpanda/admin/services/internal/level_zero_gc.cc Outdated
Comment thread src/v/cloud_topics/level_zero/stm/ctp_stm_api.cc
Comment thread src/v/cloud_topics/app.cc
@vbotbuildovich

vbotbuildovich commented Feb 5, 2026

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#80170
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
MasterTestSuite test_remote_partition_read_cached_index unit https://buildkite.com/redpanda/redpanda/builds/80170#019c2b2b-ae1e-4070-b0b0-ccc14c9d6a60 FAIL 0/1
test results on build#80234
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CloudTopicsL0GCAdminTest test_advance_epoch {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/80234#019c2f3b-171b-4b61-82d6-4ac7508ae2aa FLAKY 8/11 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCAdminTest&test_method=test_advance_epoch
CloudTopicsL0GCAdminTest test_advance_epoch {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/80234#019c2f48-5c3f-4fc5-b6cb-38a4f5729396 FLAKY 10/11 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCAdminTest&test_method=test_advance_epoch
AutomaticLeadershipBalancingTest test_automatic_rebalance null integration https://buildkite.com/redpanda/redpanda/builds/80234#019c2f3b-1716-4565-bdc7-84a059c6ffef FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AutomaticLeadershipBalancingTest&test_method=test_automatic_rebalance
test results on build#80249
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
CloudTopicsL0GCAdminTest test_advance_epoch {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/80249#019c2f9e-e083-4292-86bf-11c616317946 FLAKY 28/35 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCAdminTest&test_method=test_advance_epoch
CloudTopicsL0GCAdminTest test_advance_epoch {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/80249#019c2f9e-e083-4de1-b94b-9165a52483fc FLAKY 33/35 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCAdminTest&test_method=test_advance_epoch
CloudTopicsL0GCAdminTest test_advance_epoch {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/80249#019c2f9e-e089-45be-8407-aa5765c6bb76 FLAKY 29/35 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCAdminTest&test_method=test_advance_epoch
CloudTopicsL0GCAdminTest test_advance_epoch {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/80249#019c2f9e-e08a-44f2-9ab0-d5500e71991b FLAKY 28/35 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CloudTopicsL0GCAdminTest&test_method=test_advance_epoch
test results on build#80261
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
QuotaManagementUpgradeTest test_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/80261#019c2ff9-ea4a-4a11-b917-9e507383b623 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0678, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1898, p1=0.1219, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=QuotaManagementUpgradeTest&test_method=test_upgrade
test results on build#80294
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkTopicFailoverTests test_link_failover {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}, "with_failures": false} integration https://buildkite.com/redpanda/redpanda/builds/80294#019c31b1-1d57-4b62-871e-079812ad6b22 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkTopicFailoverTests&test_method=test_link_failover
QuotaManagementUpgradeTest test_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/80294#019c31ad-b400-46fb-b590-c86a19d92fff FLAKY 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0687, p0=0.5094, reject_threshold=0.0100. adj_baseline=0.1924, p1=0.3993, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=QuotaManagementUpgradeTest&test_method=test_upgrade
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/80294#019c31ad-b402-498f-891b-9bd2e5f46acf FLAKY 6/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0256, p0=0.0001, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/80294#019c31b1-1d53-43a2-b2bb-587fce4b0261 FLAKY 15/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1125, p0=0.0661, reject_threshold=0.0100. adj_baseline=0.3010, p1=0.4124, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#80624
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/80624#019c6e08-d1df-4bb9-b627-446ad7acf4b5 FLAKY 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1160, p0=0.7085, reject_threshold=0.0100. adj_baseline=0.3091, p1=0.1356, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
test results on build#80786
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/80786#019c7706-cd87-4c69-8bc4-a9caf44a5665 FLAKY 25/31 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0850, p0=0.1064, reject_threshold=0.0100. adj_baseline=0.2338, p1=0.2645, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

@oleiman oleiman force-pushed the ct/core-14829/epoch-advance-admin-apis branch 2 times, most recently from 4e24003 to dd4f57a Compare February 5, 2026 17:23
@vbotbuildovich

vbotbuildovich commented Feb 5, 2026

Copy link
Copy Markdown
Collaborator

Retry command for Build#80221

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cloud_topics/l0_gc_test.py::CloudTopicsL0GCAdminTest.test_advance_epoch@{"cloud_storage_type":1}

@oleiman oleiman force-pushed the ct/core-14829/epoch-advance-admin-apis branch 2 times, most recently from a0efd19 to 3070332 Compare February 5, 2026 18:58
@vbotbuildovich

vbotbuildovich commented Feb 5, 2026

Copy link
Copy Markdown
Collaborator

Retry command for Build#80234

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cloud_topics/l0_gc_test.py::CloudTopicsL0GCAdminTest.test_advance_epoch@{"cloud_storage_type":1}

@oleiman

oleiman commented Feb 5, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 2
skip-redpanda-build
skip-units
skip-rebase
dt-repeat=25
tests/rptest/tests/cloud_topics/l0_gc_test.py::CloudTopicsL0GCAdminTest.test_advance_epoch@{"cloud_storage_type":1}

@oleiman

oleiman commented Feb 5, 2026

Copy link
Copy Markdown
Member Author

New DT test is failing in both modes for example, but I can't get it to fail locally

@oleiman

oleiman commented Feb 5, 2026

Copy link
Copy Markdown
Member Author

seems we can enter a race of sorts between the cached cluster epoch on whatever node services the admin rpc and some higher value being consumed on the L0 write path, such that when we advance the epoch, we "advance" it to a value which is still strictly less than the epoch on any existing L0 object. in this case GC will still never make progress.

maybe it's better to force the epoch to a specific value anyway. that makes the rpc even more unsafe, but this a break-glass style thing anyway.

@oleiman oleiman force-pushed the ct/core-14829/epoch-advance-admin-apis branch from 3070332 to a9d3ded Compare February 5, 2026 22:23
@oleiman

oleiman commented Feb 6, 2026

Copy link
Copy Markdown
Member Author

ci-repeat 2
debug
skip-redpanda-build
skip-units
skip-rebase
dt-repeat=42
tests/rptest/tests/cloud_topics/l0_gc_test.py::CloudTopicsL0GCAdminTest.test_advance_epoch@{"cloud_storage_type":1}

@oleiman oleiman force-pushed the ct/core-14829/epoch-advance-admin-apis branch from a9d3ded to 2655ade Compare February 6, 2026 06:20
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Retry command for Build#80294

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

@oleiman

oleiman commented Feb 6, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 1
debug
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":false}

@oleiman oleiman added the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label Feb 6, 2026
@oleiman oleiman force-pushed the ct/core-14829/epoch-advance-admin-apis branch from 2655ade to a85156f Compare February 17, 2026 22:59
@oleiman oleiman requested a review from rockwotj February 18, 2026 23:20
dotnwat
dotnwat previously approved these changes Feb 19, 2026

@dotnwat dotnwat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. afaict Tyler's feedback is also addressed.

@oleiman oleiman removed the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label Feb 19, 2026
@oleiman

oleiman commented Feb 19, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase

@oleiman

oleiman commented Feb 19, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 1

@oleiman oleiman enabled auto-merge February 19, 2026 04:42
@oleiman

oleiman commented Feb 19, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 1

@dotnwat

dotnwat commented Feb 19, 2026

Copy link
Copy Markdown
Member

@oleiman merge conflict

- partition_leaders_table
- partition_manager
- shard_table

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Helper function to construct a cloud_topics::frontend instance on demand for
a specific partition.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
- if not leader, fwd to leader or 404
- look up the partition
- if not present, bail
- if not cloud topic, bail
- if cloud topics not initialized, bail
- create a cloud_topics::frontend
- call advance_epoch and return the result

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
- Takes a list of TopicPartitions
- Groups the input list by leader node
- For locally led TPs
  - On leader shard, request epoch_info from cloud_topics::frontend
- For remotely led TPs, dispatch request to leader node

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ct/core-14829/epoch-advance-admin-apis branch from 6860356 to c54fc8e Compare February 19, 2026 16:50
@oleiman

oleiman commented Feb 19, 2026

Copy link
Copy Markdown
Member Author

force push rebase dev to fix merge conflict

@oleiman oleiman requested a review from dotnwat February 19, 2026 16:50
dotnwat
dotnwat previously approved these changes Feb 19, 2026
rockwotj
rockwotj previously approved these changes Feb 19, 2026
@oleiman oleiman disabled auto-merge February 19, 2026 17:03
@oleiman

oleiman commented Feb 19, 2026

Copy link
Copy Markdown
Member Author

local test flake. cancelling these builds

@oleiman oleiman dismissed stale reviews from rockwotj and dotnwat via 2f49544 February 19, 2026 17:27
@oleiman oleiman force-pushed the ct/core-14829/epoch-advance-admin-apis branch from c54fc8e to 2f49544 Compare February 19, 2026 17:27
- Produce to a subset of extant cloud topics to ensure that GC won't progress
- Check that GetEpochInfo gives expected results
- Check that GC doesn't make progress
- AdvanceEpoch
- Check that GC kicks in eventually

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ct/core-14829/epoch-advance-admin-apis branch from 2f49544 to 78cd6e2 Compare February 19, 2026 17:37

@dotnwat dotnwat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like just some tweaks to the ducktape test?

@oleiman

oleiman commented Feb 19, 2026

Copy link
Copy Markdown
Member Author

looks like just some tweaks to the ducktape test?

yeah forgot summarize. force push needed a different config override for the housekeeper because it has to go through SiSettings

@oleiman oleiman enabled auto-merge February 19, 2026 18:36
@oleiman oleiman merged commit c6ee51e into redpanda-data:dev Feb 19, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants