storage: manage priority logs in housekeeping_loop#29382
Conversation
|
Backporting to v25.3.x should be safe/clean. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a priority housekeeping loop to prevent starvation of critical partitions like __consumer_offsets when they share a shard with heavy-weight compaction workloads. The solution implements a preemptive compaction mechanism with configurable wait times.
Changes:
- Added a new
priority_housekeeping_loop()that monitors and compacts priority partitions (e.g.,__consumer_offsets) independently - Introduced a
_compaction_mutexto serialize compaction operations and_compaction_preempt_asabort source to enable preemption of regular compaction - Added a new cluster configuration
log_compaction_max_priority_wait_ms(default: 30 minutes) to control when priority compaction preempts regular compaction
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/v/storage/log_manager.h | Added priority housekeeping infrastructure including new member variables, methods, and the _priority_logs_list |
| src/v/storage/log_manager.cc | Implemented priority housekeeping loop, preemption logic, and refactored housekeeping code into do_housekeeping() shared method |
| src/v/storage/tests/storage_e2e_test.cc | Added test verifying priority compaction preempts regular compaction after configured wait time |
| src/v/storage/tests/common.h | Added test accessor methods for _compaction_mutex and _compaction_preempt_as |
| src/v/ssx/mutex.h | Added get_units(duration timeout) overload to support timed mutex acquisition |
| src/v/config/configuration.h | Declared new log_compaction_max_priority_wait_ms configuration property |
| src/v/config/configuration.cc | Defined log_compaction_max_priority_wait_ms with default value of 30 minutes |
| src/v/storage/BUILD | Added dependency on ssx:abort_source |
40d55fc to
7f32600
Compare
Retry command for Build#79566please wait until all jobs are finished before running the slash command |
8293935 to
98a579a
Compare
Retry command for Build#79581please wait until all jobs are finished before running the slash command |
|
/ci-repeat 1 |
| // We clamp the offset up to which we can remove transactional control | ||
| // batches to the last snapshot taken by the transactional stm. This | ||
| // ensures that we do not remove control batches that may be needed to | ||
| // reconstruct the state machine during recovery. |
There was a problem hiding this comment.
Argh. Why did claude remove this comment when moving the function? I'll have to force push to fix this again- going to wait for other review comments to come in before doing so.
| } | ||
| } | ||
|
|
||
| ss::future<> log_manager::priority_housekeeping_loop() { |
There was a problem hiding this comment.
discussed offline a way to possibly simplify this without much code change to the existing loop
There was a problem hiding this comment.
Reworked. LMKWYT
This will be useful in a future commit for de-duplicating code.
98a579a to
dc02da6
Compare
The driving force behind this change is the fact that we have seen partitions in `__consumer_offsets` get placed on the same shard as heavy-weight compaction partitions which dominate run-time, leading to starvation and a intermitently large `__consumer_offsets` topic. To prevent this, constantly evaluate if priority partitions need compaction on the interval `log_compaction_max_priority_wait_ms`. When any priority partition requires compaction, the long running in-progress compaction will be preempted and the priority partitions can be compacted.
dc02da6 to
2beadb7
Compare
storage: add priority_housekeeping_loopstorage: manage priority logs in 'housekeeping_loop`
storage: manage priority logs in 'housekeeping_loop`storage: manage priority logs in housekeeping_loop
dotnwat
left a comment
There was a problem hiding this comment.
this looks pretty good. ill review in more detail tomorrow.
one question: should we preempt all compactions (even priority compactions) that have been running for a long time? that is, given more than one priority compaction, should we try to make progress across the priority compactions?
|
@rockwotj as an fyi in case he has an opinion |
I think, for now, given that we know
Ideally yes, if we should ever extend our "priority" definition. |
SGTM |
| // Create a composite abort source that triggers on shutdown or | ||
| // preemption. If no preempt_source is provided, just use _abort_source | ||
| // directly. | ||
| std::optional<ssx::composite_abort_source> composite_as; | ||
| auto& as = [this, &preempt_source, &composite_as]() -> ss::abort_source& { | ||
| if (preempt_source.has_value()) { | ||
| composite_as.emplace(_abort_source, preempt_source->get()); | ||
| return composite_as->as(); | ||
| } | ||
| return _abort_source; | ||
| }(); |
There was a problem hiding this comment.
afaict this commit is intended to be "factor out code for reuse", so this is net-new functionality and should probably be in the next commit?
| continue; | ||
| } | ||
|
|
||
| current_log.flags |= bflags::compacted; |
There was a problem hiding this comment.
nit: would one benefit of setting this flag before compaction ran is that if a problem occurred while compacting that housekeeping wouldn't reconsider it for a while?
There was a problem hiding this comment.
This flag is actually only considered when initializing the key-offset map starting on L299, it's not used for any scheduling decisions.
| log_meta.handle->config().min_cleanable_dirty_ratio(), | ||
| log_meta.handle->config().max_compaction_lag_ms()); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
why doesn't the priority compaction handling here also do something equivalent to
if (is_not_set(current_log.flags, bflags::should_compact)) {
// Still perform gc() here on a regular `log_compaction_interval_ms`
// basis. Use `try_get_units()` to avoid concurrent garbage
// collection with `gc_loop()`- if we fail to obtain units,
// it is because urgent garbage collection is already underway for
// this log.
auto units = current_log.housekeeping_lock.try_get_units();
if (units.has_value()) {
co_await current_log.handle->gc(
gc_config(collection_threshold, _config.retention_bytes()));
}
continue;
}
which looks like it is applied to non-priority partitions in the main loop. i don't actually know if it should be done or not, but this is an example of why factoring out the "what should i work on next" is great, because then we can have unified handling of that choice and this question likely doesn't even arise.
There was a problem hiding this comment.
Yeah potentially a good point. Once again, because __consumer_offsets is the only priority topic, gc() should have no effect whatsoever here, but if we were to extend the definition of "priority", this would certainly need to be called.
but this is an example of why factoring out the "what should i work on next" is great, because then we can have unified handling of that choice and this question likely doesn't even arise.
Agreed, refactoring the entirety of housekeeping out into something more intelligent than a loop over the intrusive list would be great.
| // partitions can be serviced. If no priority partition needs | ||
| // compaction, rearm the timer to check on a shorter interval. |
There was a problem hiding this comment.
If no priority partition needs compaction, rearm the timer to check on a shorter interval.
I can see that it uses a shorter interval by looking at the code, but the comment doesn't help explain why a shorter interval is used.
There was a problem hiding this comment.
I can add more color inline here if you'd like, but the point is that once the timer is fired for the first time, we have already exceeded log_compaction_max_priority_wait_ms, and so at any point if a priority partition needs compaction, we need to bail (therefore we should rearm on a shorter time frame than log_compaction_max_priority_wait_ms)
There was a problem hiding this comment.
thanks, that explanation makes sense.
I can add more color inline here
up to you. if your intention was to highlight arming the timer at a smaller timeout then the comment needs to be updated. otherwise you could leave it or remove it.
|
/backport v25.3.x |
The driving force behind this change is the fact that we have seen partitions in
__consumer_offsetsget placed on the same shard as heavy-weight compaction partitions which dominate run-time, leading to starvation and a intermitently large__consumer_offsetstopic.To prevent this, constantly evaluate if priority partitions need compaction on the interval
log_compaction_max_priority_wait_ms. When any priority partition requires compaction, the long running in-progress compaction will be preempted and the priority partitions can be compacted.Backports Required
Release Notes
Improvements
__consumer_offsetspartitions when heavy-weight compactions are present on a broker.