Skip to content

[CORE-14856] storage: remove early return path in self_compact_segment()#28730

Merged
WillemKauf merged 1 commit into
redpanda-data:devfrom
WillemKauf:self_compaction_fix
Nov 25, 2025
Merged

[CORE-14856] storage: remove early return path in self_compact_segment()#28730
WillemKauf merged 1 commit into
redpanda-data:devfrom
WillemKauf:self_compaction_fix

Conversation

@WillemKauf

@WillemKauf WillemKauf commented Nov 25, 2025

Copy link
Copy Markdown
Contributor

This is actually a bug that was revealed by transactional control batch removal.

Take the following steps:

1a) 20.log:INFO  2025-11-24 16:15:04,147 [shard 0:lc  ] storage-gc - segment_utils.cc:691 - Rebuilding index file... (/var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index)
1b) 20.log:INFO  2025-11-24 16:15:04,157 [shard 0:lc  ] storage-gc - segment_utils.cc:668 - tx reducer path: /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index stats { batches processed: 1273, aborted_txs: 76, discarded batches: 744 }
2) 20.log:TRACE 2025-11-24 16:15:04,157 [shard 0:lc  ] storage-gc - segment_utils.cc:576 - self compacting segment /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.log
3a) 20.log:DEBUG 2025-11-24 16:15:04,159 [shard 0:main] storage - disk_log_impl.cc:307 - closing log {...}
3b) 20.log:DEBUG 2025-11-24 16:15:04,159 [shard 0:lc  ] storage-gc - disk_log_impl.cc:1393 - Cleaning up leftover file: /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.log.staging
...
[X-shard partition move from shard 0 to shard 1]
...
4) 20.log:DEBUG 2025-11-24 16:15:05,437 [shard 1:lc  ] storage-gc - segment_utils.cc:809 - detected /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index is already compacted

Here we have the following situation:

  1. In a)/b), the compaction_index for segment 79964-17-v1.log is rebuilt using the tx_reducer, which filters out aborted transactional data but only in the .compaction_index file. This process also marks the footer with the self_compaction footer flag, which is a terribly confusing and potentially outdated bit of code, since the code wants to interpret this as meaning the segment has also gone through a self compaction, which is not true.
  2. We begin to self compact the segment itself. This is the necessary step to ensure that aborted transactional data is removed, since those offset deltas won't exist in the bitmap generated from the compaction_index. A key detail to remember here is that window compaction does not remove data that isn't indexed in the key-offset map generated from the compaction_index, which means sliding window compaction cannot remove aborted transactional data.
  3. In a), the log is closed due to a cross-shard move of the partition. In b), we can see that the replacement segment 79964-17-v1.log.staging undergoing self compaction gets removed before it can replace 79964-17-v1.log
  4. On next attempted self compaction of 79964-17-v1.log, we detect that the self_compaction footer flag has been set in the .compaction_index, and interpret that as indicating segment 79964-17-v1.log has been self compacted already.

We then early exit, mark 79964-17-v1.log as self compacted (even though it isn't), and sliding window compaction can then:

  1. Remove the tx_fence batch
  2. Unset the transactional bits in the aborted raft data batches
  3. Remove the tx_abort control batch

Leading to a persistence of aborted data and a CI failure such as:

  File "/root/tests/rptest/transactions/verifiers/compacted_verifier.py", line 238, in _remote_info
    self.raise_on_violation(self._node)
rptest.transactions.verifiers.compacted_verifier.ConsistencyViolationException: 86784	102	0	violation	read aborted key3=71009@85210

To fix the race described in the steps above, simply remove the faulty check that is based on the footer flag in the compaction_index. While it did make sense to have this check in order to prevent unnecessary re-compaction of data over upgrades, the self_compact_timestamp flag has been present since v25.2.1 and should be considered the source of truth of whether a segment has been self compacted or not.

This failure also proves that it must be guaranteed that a segment has undergone self compaction before sliding window compaction, or else (due to the key detail mentioned above) we can get into a situation where tx_fence, transactional bits, and control batches can be removed before the aborted transactional data itself.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Bug Fixes

  • Fixes a bug in which a segment could be incorrectly marked as having finished self compaction during a race with a log shutdown

This is actually a bug that was revealed by transactional control
batch removal.

Take the following steps:

```
1a) 20.log:INFO  2025-11-24 16:15:04,147 [shard 0:lc  ] storage-gc - segment_utils.cc:691 - Rebuilding index file... (/var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index)
1b) 20.log:INFO  2025-11-24 16:15:04,157 [shard 0:lc  ] storage-gc - segment_utils.cc:668 - tx reducer path: /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index stats { batches processed: 1273, aborted_txs: 76, discarded batches: 744 }
2) 20.log:TRACE 2025-11-24 16:15:04,157 [shard 0:lc  ] storage-gc - segment_utils.cc:576 - self compacting segment /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.log
3a) 20.log:DEBUG 2025-11-24 16:15:04,159 [shard 0:main] storage - disk_log_impl.cc:307 - closing log {...}
3b) 20.log:DEBUG 2025-11-24 16:15:04,159 [shard 0:lc  ] storage-gc - disk_log_impl.cc:1393 - Cleaning up leftover file: /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.log.staging
...
[X-shard partition move from shard 0 to shard 1]
...
4) 20.log:DEBUG 2025-11-24 16:15:05,437 [shard 1:lc  ] storage-gc - segment_utils.cc:809 - detected /var/lib/redpanda/data/kafka/topic-jxifylkscz/0_55/79964-17-v1.compaction_index is already compacted
```

Here we have the following situation:

1. In a), the `compaction_index` for `segment` `79964-17-v1.log` is rebuilt
   using the `tx_reducer`, which filters out aborted transactional data _but
   only in the `.compaction_index` file_. This process also marks the footer
   with the `self_compaction` footer flag, which is a terribly confusing and
   potentially outdated bit of code, since the code wants to interpret this
   as meaning the `segment` has also gone through a self compaction, which
   is _not_ true.
2. We begin to self compact the `segment` itself. This is the necessary
   step to ensure that aborted transactional data is removed, since those
   offset deltas won't exist in the bitmap generated from the `compaction_index`.
   A key detail to remember here is that window compaction does _not_
   remove data that isn't indexed in the key-offset map generated from the
   `compaction_index`, which means sliding window compaction _cannot_
   remove aborted transactional data.
3. In a), the `log` is closed due to a cross-shard move of the partition.
   In b), we can see that the replacement `segment` `79964-17-v1.log.staging`
   undergoing self compaction gets removed before it can replace `79964-17-v1.log`
4. On next attempted self compaction of `79964-17-v1.log`, we detect that the
   `self_compaction` footer flag has been set in the `.compaction_index`, and
   interpret that as indicating `segment` `79964-17-v1.log` has been self
   compacted already.

We then early exit, mark `79964-17-v1.log` as self compacted (even though it isn't),
and sliding window compaction can then:

1. Remove the `tx_fence` batch
2. Unset the transactional bits in the aborted raft data batches
3. Remove the `tx_abort` control batch

Leading to a persistence of aborted data and a CI failure such as:

```
  File "/root/tests/rptest/transactions/verifiers/compacted_verifier.py", line 238, in _remote_info
    self.raise_on_violation(self._node)
rptest.transactions.verifiers.compacted_verifier.ConsistencyViolationException: 86784	102	0	violation	read aborted key3=71009@85210
```

To fix the race described in the steps above, simply remove the faulty check
that is based on the footer flag in the `compaction_index`. While it did
make sense to have this check in order to prevent unnecessary re-compaction
of data over upgrades, the `self_compact_timestamp` flag has been present
since `v25.2.1` and should be considered the source of truth of whether
a `segment` has been self compacted or not.

This failure also proves that it _must_ be guaranteed that a `segment`
has undergone self compaction before sliding window compaction, or else
(due to the key detail mentioned above) we can get into a situation where
`tx_fence`, transactional bits, and control batches can be removed before
the aborted transactional data itself.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition bug where a segment could incorrectly skip self-compaction during log shutdown, potentially leaving aborted transactional data in the segment. The fix removes a faulty early return path that relied on the compaction index footer flag instead of the more reliable self_compact_timestamp.

Key changes:

  • Removes the early return logic that checked segment_already_compacted based on the compaction index footer flag
  • Ensures segments always undergo proper self-compaction verification using self_compact_timestamp

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#76934
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_with_failures null integration https://buildkite.com/redpanda/redpanda/builds/76934#019ab8f4-33c4-4727-aac2-f6fcb5e20b26 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_failures
MultiRestartTest test_recovery_after_multiple_restarts {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76934#019ab91e-75db-4345-bb4f-2f088ce1f5b3 FLAKY 19/21 upstream reliability is '95.9119496855346'. current run reliability is '90.47619047619048'. drift is 5.43576 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MultiRestartTest&test_method=test_recovery_after_multiple_restarts
ScalingUpTest test_moves_with_local_retention {"use_topic_property": false} integration https://buildkite.com/redpanda/redpanda/builds/76934#019ab91e-75c8-46b0-b5b4-9259336ab777 FLAKY 20/21 upstream reliability is '98.42271293375394'. current run reliability is '95.23809523809523'. drift is 3.18462 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ScalingUpTest&test_method=test_moves_with_local_retention

@WillemKauf

Copy link
Copy Markdown
Contributor Author

/ci-repeat 1

co_await internal::mark_segment_as_finished_self_compaction(s, pb);
co_return compaction_result{s->size_bytes()};
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eeks, "already_compacted" is soo misleading..

@WillemKauf WillemKauf merged commit 310cbb7 into redpanda-data:dev Nov 25, 2025
22 checks passed
@WillemKauf WillemKauf changed the title storage: remove early return path in self_compact_segment() [CORE-14856] storage: remove early return path in self_compact_segment() Nov 26, 2025
@dotnwat

dotnwat commented Nov 27, 2025

Copy link
Copy Markdown
Member

🤯

@WillemKauf

Copy link
Copy Markdown
Contributor Author

/backport v25.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants