[CORE-14856] storage: remove early return path in self_compact_segment()#28730
Merged
Merged
Conversation
This is actually a bug that was revealed by transactional control
batch removal.
Take the following steps:
```
1a) 20.log:INFO 2025-11-24 16:15:04,147 [shard 0:lc ] storage-gc - segment_utils.cc:691 - Rebuilding index file... (/var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index)
1b) 20.log:INFO 2025-11-24 16:15:04,157 [shard 0:lc ] storage-gc - segment_utils.cc:668 - tx reducer path: /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index stats { batches processed: 1273, aborted_txs: 76, discarded batches: 744 }
2) 20.log:TRACE 2025-11-24 16:15:04,157 [shard 0:lc ] storage-gc - segment_utils.cc:576 - self compacting segment /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.log
3a) 20.log:DEBUG 2025-11-24 16:15:04,159 [shard 0:main] storage - disk_log_impl.cc:307 - closing log {...}
3b) 20.log:DEBUG 2025-11-24 16:15:04,159 [shard 0:lc ] storage-gc - disk_log_impl.cc:1393 - Cleaning up leftover file: /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.log.staging
...
[X-shard partition move from shard 0 to shard 1]
...
4) 20.log:DEBUG 2025-11-24 16:15:05,437 [shard 1:lc ] storage-gc - segment_utils.cc:809 - detected /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index is already compacted
```
Here we have the following situation:
1. In a), the `compaction_index` for `segment` `79964-17-v1.log` is rebuilt
using the `tx_reducer`, which filters out aborted transactional data _but
only in the `.compaction_index` file_. This process also marks the footer
with the `self_compaction` footer flag, which is a terribly confusing and
potentially outdated bit of code, since the code wants to interpret this
as meaning the `segment` has also gone through a self compaction, which
is _not_ true.
2. We begin to self compact the `segment` itself. This is the necessary
step to ensure that aborted transactional data is removed, since those
offset deltas won't exist in the bitmap generated from the `compaction_index`.
A key detail to remember here is that window compaction does _not_
remove data that isn't indexed in the key-offset map generated from the
`compaction_index`, which means sliding window compaction _cannot_
remove aborted transactional data.
3. In a), the `log` is closed due to a cross-shard move of the partition.
In b), we can see that the replacement `segment` `79964-17-v1.log.staging`
undergoing self compaction gets removed before it can replace `79964-17-v1.log`
4. On next attempted self compaction of `79964-17-v1.log`, we detect that the
`self_compaction` footer flag has been set in the `.compaction_index`, and
interpret that as indicating `segment` `79964-17-v1.log` has been self
compacted already.
We then early exit, mark `79964-17-v1.log` as self compacted (even though it isn't),
and sliding window compaction can then:
1. Remove the `tx_fence` batch
2. Unset the transactional bits in the aborted raft data batches
3. Remove the `tx_abort` control batch
Leading to a persistence of aborted data and a CI failure such as:
```
File "/root/tests/rptest/transactions/verifiers/compacted_verifier.py", line 238, in _remote_info
self.raise_on_violation(self._node)
rptest.transactions.verifiers.compacted_verifier.ConsistencyViolationException: 86784 102 0 violation read aborted key3=71009@85210
```
To fix the race described in the steps above, simply remove the faulty check
that is based on the footer flag in the `compaction_index`. While it did
make sense to have this check in order to prevent unnecessary re-compaction
of data over upgrades, the `self_compact_timestamp` flag has been present
since `v25.2.1` and should be considered the source of truth of whether
a `segment` has been self compacted or not.
This failure also proves that it _must_ be guaranteed that a `segment`
has undergone self compaction before sliding window compaction, or else
(due to the key detail mentioned above) we can get into a situation where
`tx_fence`, transactional bits, and control batches can be removed before
the aborted transactional data itself.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition bug where a segment could incorrectly skip self-compaction during log shutdown, potentially leaving aborted transactional data in the segment. The fix removes a faulty early return path that relied on the compaction index footer flag instead of the more reliable self_compact_timestamp.
Key changes:
- Removes the early return logic that checked
segment_already_compactedbased on the compaction index footer flag - Ensures segments always undergo proper self-compaction verification using
self_compact_timestamp
Collaborator
CI test resultstest results on build#76934
|
Contributor
Author
|
/ci-repeat 1 |
bharathv
approved these changes
Nov 25, 2025
| co_await internal::mark_segment_as_finished_self_compaction(s, pb); | ||
| co_return compaction_result{s->size_bytes()}; | ||
| } | ||
|
|
Contributor
There was a problem hiding this comment.
eeks, "already_compacted" is soo misleading..
storage: remove early return path in self_compact_segment()storage: remove early return path in self_compact_segment()
Member
|
🤯 |
Contributor
Author
|
/backport v25.3.x |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is actually a bug that was revealed by transactional control batch removal.
Take the following steps:
Here we have the following situation:
compaction_indexforsegment79964-17-v1.logis rebuilt using thetx_reducer, which filters out aborted transactional data but only in the.compaction_indexfile. This process also marks the footer with theself_compactionfooter flag, which is a terribly confusing and potentially outdated bit of code, since the code wants to interpret this as meaning thesegmenthas also gone through a self compaction, which is not true.segmentitself. This is the necessary step to ensure that aborted transactional data is removed, since those offset deltas won't exist in the bitmap generated from thecompaction_index. A key detail to remember here is that window compaction does not remove data that isn't indexed in the key-offset map generated from thecompaction_index, which means sliding window compaction cannot remove aborted transactional data.logis closed due to a cross-shard move of the partition. In b), we can see that the replacementsegment79964-17-v1.log.stagingundergoing self compaction gets removed before it can replace79964-17-v1.log79964-17-v1.log, we detect that theself_compactionfooter flag has been set in the.compaction_index, and interpret that as indicatingsegment79964-17-v1.loghas been self compacted already.We then early exit, mark
79964-17-v1.logas self compacted (even though it isn't), and sliding window compaction can then:tx_fencebatchtx_abortcontrol batchLeading to a persistence of aborted data and a CI failure such as:
To fix the race described in the steps above, simply remove the faulty check that is based on the footer flag in the
compaction_index. While it did make sense to have this check in order to prevent unnecessary re-compaction of data over upgrades, theself_compact_timestampflag has been present sincev25.2.1and should be considered the source of truth of whether asegmenthas been self compacted or not.This failure also proves that it must be guaranteed that a
segmenthas undergone self compaction before sliding window compaction, or else (due to the key detail mentioned above) we can get into a situation wheretx_fence, transactional bits, and control batches can be removed before the aborted transactional data itself.Backports Required
Release Notes
Bug Fixes
segmentcould be incorrectly marked as having finished self compaction during a race with alogshutdown