Skip to content

[v25.3.x] [CORE-14856] storage: remove early return path in self_compact_segment()#28799

Merged
WillemKauf merged 1 commit into
redpanda-data:v25.3.xfrom
vbotbuildovich:backport-pr-28730-v25.3.x-990
Dec 2, 2025
Merged

[v25.3.x] [CORE-14856] storage: remove early return path in self_compact_segment()#28799
WillemKauf merged 1 commit into
redpanda-data:v25.3.xfrom
vbotbuildovich:backport-pr-28730-v25.3.x-990

Conversation

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Backport of PR #28730

This is actually a bug that was revealed by transactional control
batch removal.

Take the following steps:

```
1a) 20.log:INFO  2025-11-24 16:15:04,147 [shard 0:lc  ] storage-gc - segment_utils.cc:691 - Rebuilding index file... (/var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index)
1b) 20.log:INFO  2025-11-24 16:15:04,157 [shard 0:lc  ] storage-gc - segment_utils.cc:668 - tx reducer path: /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index stats { batches processed: 1273, aborted_txs: 76, discarded batches: 744 }
2) 20.log:TRACE 2025-11-24 16:15:04,157 [shard 0:lc  ] storage-gc - segment_utils.cc:576 - self compacting segment /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.log
3a) 20.log:DEBUG 2025-11-24 16:15:04,159 [shard 0:main] storage - disk_log_impl.cc:307 - closing log {...}
3b) 20.log:DEBUG 2025-11-24 16:15:04,159 [shard 0:lc  ] storage-gc - disk_log_impl.cc:1393 - Cleaning up leftover file: /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.log.staging
...
[X-shard partition move from shard 0 to shard 1]
...
4) 20.log:DEBUG 2025-11-24 16:15:05,437 [shard 1:lc  ] storage-gc - segment_utils.cc:809 - detected /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index is already compacted
```

Here we have the following situation:

1. In a), the `compaction_index` for `segment` `79964-17-v1.log` is rebuilt
   using the `tx_reducer`, which filters out aborted transactional data _but
   only in the `.compaction_index` file_. This process also marks the footer
   with the `self_compaction` footer flag, which is a terribly confusing and
   potentially outdated bit of code, since the code wants to interpret this
   as meaning the `segment` has also gone through a self compaction, which
   is _not_ true.
2. We begin to self compact the `segment` itself. This is the necessary
   step to ensure that aborted transactional data is removed, since those
   offset deltas won't exist in the bitmap generated from the `compaction_index`.
   A key detail to remember here is that window compaction does _not_
   remove data that isn't indexed in the key-offset map generated from the
   `compaction_index`, which means sliding window compaction _cannot_
   remove aborted transactional data.
3. In a), the `log` is closed due to a cross-shard move of the partition.
   In b), we can see that the replacement `segment` `79964-17-v1.log.staging`
   undergoing self compaction gets removed before it can replace `79964-17-v1.log`
4. On next attempted self compaction of `79964-17-v1.log`, we detect that the
   `self_compaction` footer flag has been set in the `.compaction_index`, and
   interpret that as indicating `segment` `79964-17-v1.log` has been self
   compacted already.

We then early exit, mark `79964-17-v1.log` as self compacted (even though it isn't),
and sliding window compaction can then:

1. Remove the `tx_fence` batch
2. Unset the transactional bits in the aborted raft data batches
3. Remove the `tx_abort` control batch

Leading to a persistence of aborted data and a CI failure such as:

```
  File "/root/tests/rptest/transactions/verifiers/compacted_verifier.py", line 238, in _remote_info
    self.raise_on_violation(self._node)
rptest.transactions.verifiers.compacted_verifier.ConsistencyViolationException: 86784	102	0	violation	read aborted key3=71009@85210
```

To fix the race described in the steps above, simply remove the faulty check
that is based on the footer flag in the `compaction_index`. While it did
make sense to have this check in order to prevent unnecessary re-compaction
of data over upgrades, the `self_compact_timestamp` flag has been present
since `v25.2.1` and should be considered the source of truth of whether
a `segment` has been self compacted or not.

This failure also proves that it _must_ be guaranteed that a `segment`
has undergone self compaction before sliding window compaction, or else
(due to the key detail mentioned above) we can get into a situation where
`tx_fence`, transactional bits, and control batches can be removed before
the aborted transactional data itself.

(cherry picked from commit 53f62d9)
@vbotbuildovich vbotbuildovich added this to the v25.3.x-next milestone Dec 1, 2025
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Dec 1, 2025
@vbotbuildovich

Copy link
Copy Markdown
Collaborator Author

CI test results

test results on build#77157
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ControllerLogLimitMirrorMakerTests test_mirror_maker_with_limits null integration https://buildkite.com/redpanda/redpanda/builds/77157#019add35-552c-4dc6-91b0-3c418146c274 FLAKY 20/21 upstream reliability is '97.73462783171522'. current run reliability is '95.23809523809523'. drift is 2.49653 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ControllerLogLimitMirrorMakerTests&test_method=test_mirror_maker_with_limits
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/77157#019add35-5529-401e-bdd1-faaba446e573 FLAKY 17/21 upstream reliability is '80.32220943613349'. current run reliability is '80.95238095238095'. drift is -0.63017 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
SimpleEndToEndTest test_relaxed_acks {"write_caching": false} integration https://buildkite.com/redpanda/redpanda/builds/77157#019add35-5530-4922-a58a-697ee45912bf FLAKY 19/21 upstream reliability is '98.79518072289156'. current run reliability is '90.47619047619048'. drift is 8.31899 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SimpleEndToEndTest&test_method=test_relaxed_acks
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/77157#019add35-5533-4e2b-b184-319707c34c12 FLAKY 17/21 upstream reliability is '89.96897621509824'. current run reliability is '80.95238095238095'. drift is 9.0166 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
src/v/storage/mvlog/tests/file_test src/v/storage/mvlog/tests/file_test unit https://buildkite.com/redpanda/redpanda/builds/77157#019adc29-064b-44b8-93c5-473ad06d3b19 FAIL 0/1
src/v/storage/mvlog/tests/file_test src/v/storage/mvlog/tests/file_test unit https://buildkite.com/redpanda/redpanda/builds/77157#019adc29-064c-4855-8f32-e3cdc20ddd10 FAIL 0/1

@WillemKauf WillemKauf merged commit 993006d into redpanda-data:v25.3.x Dec 2, 2025
19 checks passed
@tyson-redpanda tyson-redpanda modified the milestones: v25.3.x-next, v25.3.2 Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants