partition/timequery: Fix tiered storage timequery bug#28642
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a rare bug in tiered storage timequery logic where queries could incorrectly return no result despite matching offsets existing in cloud or local storage. The bug occurred when the local log was truncated to the high watermark but retained an active segment, causing the timequery to incorrectly reject queries when the start offset was after the max offset but the timestamp was within bounds.
Key changes:
- Refactored
partition::timequery()to properly handle edge cases when local log is truncated but has an active segment - Added early validation that
min_offset <= max_offsetbefore processing - Introduced
may_answer_from_locallogic that checks both timestamp and offset coverage - Added comprehensive regression test that reproduces the bug scenario with short retention and cloud data
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/v/cluster/partition.cc |
Refactored timequery logic to fix bug by adding offset range validation and improving local/cloud storage decision logic |
src/v/cloud_storage/tests/cloud_storage_e2e_test.cc |
Added regression test that reproduces the bug by setting up partition with short retention, uploading to cloud, and verifying timequery correctness |
CI test resultstest results on build#76683
test results on build#76900
test results on build#77366
|
| log()->from_log_offset(_raft->start_offset()), | ||
| local_query_cfg.min_offset); | ||
|
|
||
| co_return co_await local_timequery(local_query_cfg, false); |
There was a problem hiding this comment.
I have an intuition that this is unnecessary.
Equivalent and simpler to comprehend (sample size = 1)?
src/v/cluster/partition.cc | 15 ++-------------
1 file changed, 2 insertions(+), 13 deletions(-)
diff --git i/src/v/cluster/partition.cc w/src/v/cluster/partition.cc
index 81296fe36c..d6b7b0d866 100644
--- i/src/v/cluster/partition.cc
+++ w/src/v/cluster/partition.cc
@@ -620,7 +620,7 @@ partition::timequery(storage::timequery_config cfg) {
const bool local_covers_offsets = local_start_offset <= cfg.max_offset;
const bool may_answer_from_local = local_covers_timestamp
&& local_covers_offsets;
- if (may_answer_from_local) {
+ if (may_answer_from_local || !may_answer_from_cloud) {
// The query is ahead of the local data's start_timestamp and
// potentially overlaps with the local data offset range: this means it
// _might_ hit on local data: start_timestamp is not precise, so once we
@@ -650,18 +650,7 @@ partition::timequery(storage::timequery_config cfg) {
// 3. the local log start timestamp is after the timequery's timestamp.
// If 1 or 2 hold, there is no offset to return. Otherwise, fall back
// to a local timequery, which should return the start of the log.
- if (may_answer_from_local || !local_covers_offsets) {
- co_return std::nullopt;
- }
-
- // Adjust the lower bound for the local query as the min_offset
- // corresponds to the full log (including tiered storage).
- auto local_query_cfg = cfg;
- local_query_cfg.min_offset = std::max(
- log()->from_log_offset(_raft->start_offset()),
- local_query_cfg.min_offset);
-
- co_return co_await local_timequery(local_query_cfg, false);
+ co_return std::nullopt;
}
bool partition::may_read_from_cloud() const {There was a problem hiding this comment.
Yeah, I think this makes sense. Thanks.
nvartolomei
left a comment
There was a problem hiding this comment.
Approving. As far as the fix goes it seems correct. The comment is a suggestion.
| // The local storage hit a case where it needs to fall back | ||
| // to querying cloud storage. | ||
| co_return co_await cloud_storage_timequery(cfg); | ||
| } |
There was a problem hiding this comment.
nit: can you add a comment about the intentional fallthrough? And that a nullopt from the local_timequery is a signal to check cloud
There was a problem hiding this comment.
I think applying NV's suggestion makes this all clearer.
| local_query_cfg.min_offset = std::max( | ||
| log()->from_log_offset(_raft->start_offset()), | ||
| local_query_cfg.min_offset); | ||
|
|
||
| // If the min_offset is ahead of max_offset, the local log is empty | ||
| // or was truncated since the timequery_config was created. | ||
| if (local_query_cfg.min_offset > local_query_cfg.max_offset) { | ||
| co_return std::nullopt; | ||
| } | ||
| local_start_offset, local_query_cfg.min_offset); |
There was a problem hiding this comment.
In the previous code, it was clearer that we would only ever call local_timequery() or cloud_storage_timequery() once. Is it a bug that we aren't preserving that?
There was a problem hiding this comment.
Actually I guess that isn't true since there are cases where we called local_timequery() and then cloud_storage_timequery().
I'm finding the new code structure a bit confusing since it at first seems like we can call local_timequery() twice. But because the condition here is may_answer_from_local and we are conditioning on that below to return a nullopt, that isn't possible.
There was a problem hiding this comment.
With NV's suggestion we maintain this property.
| // If the min_offset is ahead of max_offset, the local log is empty | ||
| // or was truncated since the timequery_config was created. | ||
| if (local_query_cfg.min_offset > local_query_cfg.max_offset) { | ||
| co_return std::nullopt; |
There was a problem hiding this comment.
Just making sure I'm seeing the bug through the refactor, would an equivalent fix have been to replace this line with this?
if (may_answer_from_cloud) {
co_return co_await cloud_storage_timequery(cfg);
}
co_return std::nullopt;
There was a problem hiding this comment.
See NV's suggestion.
Retry command for Build#76900please wait until all jobs are finished before running the slash command |
|
|
I can swear this test passed locally when I proposed the patch. Anyhow: redpanda/src/v/storage/offset_translator_state.cc Lines 42 to 47 in 69e5210 We need a special case for empty log. I guess we found out what the original code was attempting to do ... but incorrectly. We need the special case. |
This change refactors timequery logic to fix a rare bug where a timequery would return no result despite an offset matching the timestamp existing in cloud storage and local storage. The bug is observed in the tiered storage model test when querying timestamps corresponding to offsets in the final, active segment of the log of a partition with tiered storage enabled and with aggressive cleanup settings. The partition gets into a state where the local log is truncated up to the high watermark, but the active segment remains. Querying for a timestamp within the bounds of the active segment meant that the timestamp was after the start timestamp for the local log, which comes from the base timestamp of the only segment, but the start offset of the log was after the max offset set by the timequery, which is the HWM and comes from the Kafka handler. This caused the timequery to return no result, incorrectly, without checking local or cloud storage. The refactor fixes the bug and hopefully cleans up the logic a bit. Regression test included.
2cc4c4e to
c0b2445
Compare
It can happen that a timequery hits an empty local log and its max offset is before the min offset, which is clamped to the start of the log. This will fail translation. This change adds extra early returns to local and cloud timequeries to handle this special case pre-translation.
|
Force push to rebase on dev, then a push to add a special case for the empty log, where max_offset can end up before min_offset (which is clamped to the start_offset). |
|
/backport v25.3.x |
|
/backport v25.2.x |
|
/backport v25.1.x |
|
/backport v24.3.x |
This change refactors timequery logic to fix a rare bug where a timequery would return no result despite an offset matching the timestamp existing in cloud storage and local storage.
The bug is observed in the tiered storage model test when querying timestamps corresponding to offsets in the final, active segment of the log of a partition with tiered storage enabled and with aggressive cleanup settings. The partition gets into a state where the local log is truncated up to the high watermark, but the active segment remains. Querying for a timestamp within the bounds of the active segment meant that the timestamp was after the start timestamp for the local log, which comes from the base timestamp of the only segment, but the start offset of the log was after the max offset set by the timequery, which is the HWM and comes from the Kafka handler. This caused the timequery to return no result, incorrectly, without checking local or cloud storage.
The refactor fixes the bug and hopefully cleans up the logic a bit. Regression test included.
Fixes CORE-14569
Backports Required
Release Notes
Bug Fixes