[CORE-14829] Cloud Topics: Admin RPCs GetEpochInfo and AdvanceEpoch#29535
Conversation
2e92733 to
ca8cc51
Compare
ca8cc51 to
0e11b3a
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds admin RPCs for advancing epochs and querying epoch information in cloud topics, enabling manual control of GC progress on idle partitions.
Changes:
- Adds
AdvanceEpochandGetEpochInfoadmin RPCs to the level zero GC service - Implements
advance_epochcommand in the ctp_stm state machine - Adds frontend methods to expose epoch advancement and info retrieval
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/rptest/tests/cloud_topics/l0_gc_test.py | Adds test for advance_epoch RPC functionality |
| tests/rptest/clients/admin/proto/.../level_zero_gc_pb2_connect.py | Generated protobuf client methods for new RPCs |
| tests/rptest/clients/admin/proto/.../level_zero_gc_pb2.pyi | Generated protobuf type stubs for new messages |
| tests/rptest/clients/admin/proto/.../level_zero_gc_pb2.py | Generated protobuf serialization code |
| src/v/redpanda/application_start.cc | Passes cluster services to ctp_stm_factory |
| src/v/redpanda/application_admin.cc | Provides partition manager and related services to GC service |
| src/v/redpanda/admin/services/internal/level_zero_gc.h | Adds method signatures for new RPCs |
| src/v/redpanda/admin/services/internal/level_zero_gc.cc | Implements advance_epoch and get_epoch_info RPCs with leader proxying |
| src/v/redpanda/admin/services/internal/BUILD | Adds dependencies for frontend and state accessors |
| src/v/cloud_topics/level_zero/stm/types.h | Adds advance_epoch command key enum value |
| src/v/cloud_topics/level_zero/stm/types.cc | Adds formatting for advance_epoch key |
| src/v/cloud_topics/level_zero/stm/tests/ctp_stm_test.cc | Adds tests for advance_epoch and sync_to_next_placeholder behavior |
| src/v/cloud_topics/level_zero/stm/ctp_stm_state.h | Exposes current_epoch_window_offset accessor |
| src/v/cloud_topics/level_zero/stm/ctp_stm_state.cc | Implements current_epoch_window_offset accessor |
| src/v/cloud_topics/level_zero/stm/ctp_stm_factory.h | Adds cluster_services member to factory |
| src/v/cloud_topics/level_zero/stm/ctp_stm_factory.cc | Passes cluster services to ctp_stm constructor |
| src/v/cloud_topics/level_zero/stm/ctp_stm_commands.h | Defines advance_epoch_cmd structure |
| src/v/cloud_topics/level_zero/stm/ctp_stm_api.h | Adds API methods for advance_epoch and sync_to_next_placeholder |
| src/v/cloud_topics/level_zero/stm/ctp_stm_api.cc | Implements advance_epoch and sync_to_next_placeholder methods |
| src/v/cloud_topics/level_zero/stm/ctp_stm.h | Adds cluster_services parameter to constructor |
| src/v/cloud_topics/level_zero/stm/ctp_stm.cc | Applies advance_epoch commands to state machine |
| src/v/cloud_topics/level_zero/stm/BUILD | Adds cluster_services dependency |
| src/v/cloud_topics/frontend/tests/frontend_test.cc | Adds test for frontend advance_epoch integration |
| src/v/cloud_topics/frontend/frontend.h | Adds epoch_info struct and advance_epoch method |
| src/v/cloud_topics/frontend/frontend.cc | Implements advance_epoch and get_epoch_info methods |
| src/v/cloud_topics/frontend/BUILD | Adds types dependency |
| src/v/cloud_topics/app.h | Adds cluster_services member |
| src/v/cloud_topics/app.cc | Constructs and exposes cluster_services |
| src/v/cloud_topics/BUILD | Adds cluster_services_impl dependency |
| proto/.../level_zero_gc.proto | Defines AdvanceEpoch and GetEpochInfo RPCs and messages |
0e11b3a to
00cc482
Compare
4e24003 to
dd4f57a
Compare
Retry command for Build#80221please wait until all jobs are finished before running the slash command |
a0efd19 to
3070332
Compare
Retry command for Build#80234please wait until all jobs are finished before running the slash command |
|
/ci-repeat 2 |
|
New DT test is failing in both modes for example, but I can't get it to fail locally |
|
seems we can enter a race of sorts between the cached cluster epoch on whatever node services the admin rpc and some higher value being consumed on the L0 write path, such that when we advance the epoch, we "advance" it to a value which is still strictly less than the epoch on any existing L0 object. in this case GC will still never make progress. maybe it's better to force the epoch to a specific value anyway. that makes the rpc even more unsafe, but this a break-glass style thing anyway. |
3070332 to
a9d3ded
Compare
|
ci-repeat 2 |
a9d3ded to
2655ade
Compare
Retry command for Build#80294please wait until all jobs are finished before running the slash command |
|
/ci-repeat 1 |
2655ade to
a85156f
Compare
dotnwat
left a comment
There was a problem hiding this comment.
lgtm. afaict Tyler's feedback is also addressed.
|
/ci-repeat 1 |
|
/ci-repeat 1 |
|
/ci-repeat 1 |
|
@oleiman merge conflict |
- partition_leaders_table - partition_manager - shard_table Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Helper function to construct a cloud_topics::frontend instance on demand for a specific partition. Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
- if not leader, fwd to leader or 404 - look up the partition - if not present, bail - if not cloud topic, bail - if cloud topics not initialized, bail - create a cloud_topics::frontend - call advance_epoch and return the result Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
- Takes a list of TopicPartitions - Groups the input list by leader node - For locally led TPs - On leader shard, request epoch_info from cloud_topics::frontend - For remotely led TPs, dispatch request to leader node Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
6860356 to
c54fc8e
Compare
|
force push rebase dev to fix merge conflict |
|
local test flake. cancelling these builds |
c54fc8e to
2f49544
Compare
- Produce to a subset of extant cloud topics to ensure that GC won't progress - Check that GetEpochInfo gives expected results - Check that GC doesn't make progress - AdvanceEpoch - Check that GC kicks in eventually Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
2f49544 to
78cd6e2
Compare
dotnwat
left a comment
There was a problem hiding this comment.
looks like just some tweaks to the ducktape test?
yeah forgot summarize. force push needed a different config override for the housekeeper because it has to go through |
Builds on #29536
Backports Required
Release Notes