[v25.3.x] [CORE-14856] storage: remove early return path in self_compact_segment()#28799
Merged
WillemKauf merged 1 commit intoDec 2, 2025
Conversation
This is actually a bug that was revealed by transactional control
batch removal.
Take the following steps:
```
1a) 20.log:INFO 2025-11-24 16:15:04,147 [shard 0:lc ] storage-gc - segment_utils.cc:691 - Rebuilding index file... (/var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index)
1b) 20.log:INFO 2025-11-24 16:15:04,157 [shard 0:lc ] storage-gc - segment_utils.cc:668 - tx reducer path: /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index stats { batches processed: 1273, aborted_txs: 76, discarded batches: 744 }
2) 20.log:TRACE 2025-11-24 16:15:04,157 [shard 0:lc ] storage-gc - segment_utils.cc:576 - self compacting segment /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.log
3a) 20.log:DEBUG 2025-11-24 16:15:04,159 [shard 0:main] storage - disk_log_impl.cc:307 - closing log {...}
3b) 20.log:DEBUG 2025-11-24 16:15:04,159 [shard 0:lc ] storage-gc - disk_log_impl.cc:1393 - Cleaning up leftover file: /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.log.staging
...
[X-shard partition move from shard 0 to shard 1]
...
4) 20.log:DEBUG 2025-11-24 16:15:05,437 [shard 1:lc ] storage-gc - segment_utils.cc:809 - detected /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index is already compacted
```
Here we have the following situation:
1. In a), the `compaction_index` for `segment` `79964-17-v1.log` is rebuilt
using the `tx_reducer`, which filters out aborted transactional data _but
only in the `.compaction_index` file_. This process also marks the footer
with the `self_compaction` footer flag, which is a terribly confusing and
potentially outdated bit of code, since the code wants to interpret this
as meaning the `segment` has also gone through a self compaction, which
is _not_ true.
2. We begin to self compact the `segment` itself. This is the necessary
step to ensure that aborted transactional data is removed, since those
offset deltas won't exist in the bitmap generated from the `compaction_index`.
A key detail to remember here is that window compaction does _not_
remove data that isn't indexed in the key-offset map generated from the
`compaction_index`, which means sliding window compaction _cannot_
remove aborted transactional data.
3. In a), the `log` is closed due to a cross-shard move of the partition.
In b), we can see that the replacement `segment` `79964-17-v1.log.staging`
undergoing self compaction gets removed before it can replace `79964-17-v1.log`
4. On next attempted self compaction of `79964-17-v1.log`, we detect that the
`self_compaction` footer flag has been set in the `.compaction_index`, and
interpret that as indicating `segment` `79964-17-v1.log` has been self
compacted already.
We then early exit, mark `79964-17-v1.log` as self compacted (even though it isn't),
and sliding window compaction can then:
1. Remove the `tx_fence` batch
2. Unset the transactional bits in the aborted raft data batches
3. Remove the `tx_abort` control batch
Leading to a persistence of aborted data and a CI failure such as:
```
File "/root/tests/rptest/transactions/verifiers/compacted_verifier.py", line 238, in _remote_info
self.raise_on_violation(self._node)
rptest.transactions.verifiers.compacted_verifier.ConsistencyViolationException: 86784 102 0 violation read aborted key3=71009@85210
```
To fix the race described in the steps above, simply remove the faulty check
that is based on the footer flag in the `compaction_index`. While it did
make sense to have this check in order to prevent unnecessary re-compaction
of data over upgrades, the `self_compact_timestamp` flag has been present
since `v25.2.1` and should be considered the source of truth of whether
a `segment` has been self compacted or not.
This failure also proves that it _must_ be guaranteed that a `segment`
has undergone self compaction before sliding window compaction, or else
(due to the key detail mentioned above) we can get into a situation where
`tx_fence`, transactional bits, and control batches can be removed before
the aborted transactional data itself.
(cherry picked from commit 53f62d9)
Collaborator
Author
CI test resultstest results on build#77157
|
WillemKauf
approved these changes
Dec 2, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport of PR #28730