Skip to content

partition/timequery: Fix tiered storage timequery bug#28642

Merged
wdberkeley merged 3 commits into
redpanda-data:devfrom
wdberkeley:timequery-bad
Dec 5, 2025
Merged

partition/timequery: Fix tiered storage timequery bug#28642
wdberkeley merged 3 commits into
redpanda-data:devfrom
wdberkeley:timequery-bad

Conversation

@wdberkeley

Copy link
Copy Markdown
Contributor

This change refactors timequery logic to fix a rare bug where a timequery would return no result despite an offset matching the timestamp existing in cloud storage and local storage.

The bug is observed in the tiered storage model test when querying timestamps corresponding to offsets in the final, active segment of the log of a partition with tiered storage enabled and with aggressive cleanup settings. The partition gets into a state where the local log is truncated up to the high watermark, but the active segment remains. Querying for a timestamp within the bounds of the active segment meant that the timestamp was after the start timestamp for the local log, which comes from the base timestamp of the only segment, but the start offset of the log was after the max offset set by the timequery, which is the HWM and comes from the Kafka handler. This caused the timequery to return no result, incorrectly, without checking local or cloud storage.

The refactor fixes the bug and hopefully cleans up the logic a bit. Regression test included.

Fixes CORE-14569

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Bug Fixes

  • Fixed a rare bug that causes timequeries against partitions using tiered storage to incorrectly return no result when the partition's local log is empty but retains an active segment.

Copilot AI review requested due to automatic review settings November 19, 2025 21:21

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a rare bug in tiered storage timequery logic where queries could incorrectly return no result despite matching offsets existing in cloud or local storage. The bug occurred when the local log was truncated to the high watermark but retained an active segment, causing the timequery to incorrectly reject queries when the start offset was after the max offset but the timestamp was within bounds.

Key changes:

  • Refactored partition::timequery() to properly handle edge cases when local log is truncated but has an active segment
  • Added early validation that min_offset <= max_offset before processing
  • Introduced may_answer_from_local logic that checks both timestamp and offset coverage
  • Added comprehensive regression test that reproduces the bug scenario with short retention and cloud data

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/v/cluster/partition.cc Refactored timequery logic to fix bug by adding offset range validation and improving local/cloud storage decision logic
src/v/cloud_storage/tests/cloud_storage_e2e_test.cc Added regression test that reproduces the bug by setting up partition with short retention, uploading to cloud, and verifying timequery correctness

@vbotbuildovich

vbotbuildovich commented Nov 19, 2025

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#76683
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
DataMigrationsApiTest test_higher_level_migration_api null integration https://buildkite.com/redpanda/redpanda/builds/76683#019a9e1b-6589-44f5-b422-f52f5bde80b2 FLAKY 20/21 upstream reliability is '99.41747572815534'. current run reliability is '95.23809523809523'. drift is 4.17938 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_higher_level_migration_api
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76683#019a9e1d-ac19-465f-8f3e-32490d71688e FLAKY 16/21 upstream reliability is '79.19227392449517'. current run reliability is '76.19047619047619'. drift is 3.0018 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
MultiRestartTest test_recovery_after_multiple_restarts {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76683#019a9e1d-ac24-4e3d-8e88-351b78aa9c2e FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MultiRestartTest&test_method=test_recovery_after_multiple_restarts
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/76683#019a9e1d-ac20-4c51-be59-b163a0fd46ee FLAKY 19/21 upstream reliability is '91.27310061601642'. current run reliability is '90.47619047619048'. drift is 0.79691 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#76900
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
MultiRestartTest test_recovery_after_multiple_restarts {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76900#019ab729-d35a-42d4-9277-b901ed71f32f FLAKY 20/21 upstream reliability is '95.89905362776025'. current run reliability is '95.23809523809523'. drift is 0.66096 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MultiRestartTest&test_method=test_recovery_after_multiple_restarts
TimeQueryTest test_timequery_empty_local_log null integration https://buildkite.com/redpanda/redpanda/builds/76900#019ab729-3881-4fbd-b809-52cb49df1e98 FLAKY 6/21 upstream reliability is '100.0'. current run reliability is '28.57142857142857'. drift is 71.42857 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TimeQueryTest&test_method=test_timequery_empty_local_log
TimeQueryTest test_timequery_empty_local_log null integration https://buildkite.com/redpanda/redpanda/builds/76900#019ab729-d359-48bd-a5b9-a8083ab8438d FLAKY 6/21 upstream reliability is '100.0'. current run reliability is '28.57142857142857'. drift is 71.42857 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TimeQueryTest&test_method=test_timequery_empty_local_log
test results on build#77366
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
JavaCompressionTest test_upgrade_java_compression {"compression_type": "gzip"} integration https://buildkite.com/redpanda/redpanda/builds/77366#019aeb99-9f85-49b7-85c5-d2f53bb579b5 FLAKY 20/21 upstream reliability is '88.20224719101124'. current run reliability is '95.23809523809523'. drift is -7.03585 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "gzip"} integration https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee2-43b0-8f4f-d8c0077ad3b0 FLAKY 20/21 upstream reliability is '88.20224719101124'. current run reliability is '95.23809523809523'. drift is -7.03585 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "lz4"} integration https://buildkite.com/redpanda/redpanda/builds/77366#019aeb99-9f87-4622-a59a-c6c09173769d FLAKY 20/21 upstream reliability is '88.23529411764706'. current run reliability is '95.23809523809523'. drift is -7.0028 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "lz4"} integration https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee4-4b34-8ac3-7c952af48362 FLAKY 20/21 upstream reliability is '88.23529411764706'. current run reliability is '95.23809523809523'. drift is -7.0028 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "snappy"} integration https://buildkite.com/redpanda/redpanda/builds/77366#019aeb99-9f88-4867-be14-f747b3ed8692 FLAKY 20/21 upstream reliability is '88.23529411764706'. current run reliability is '95.23809523809523'. drift is -7.0028 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "snappy"} integration https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee5-4392-bfaf-261b3e833853 FLAKY 19/21 upstream reliability is '88.23529411764706'. current run reliability is '90.47619047619048'. drift is -2.2409 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "zstd"} integration https://buildkite.com/redpanda/redpanda/builds/77366#019aeb99-9f8a-4e65-81b1-4fbb50a32e21 FLAKY 20/21 upstream reliability is '88.23529411764706'. current run reliability is '95.23809523809523'. drift is -7.0028 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest test_upgrade_java_compression {"compression_type": "zstd"} integration https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee7-4e3d-b834-478fdb10bc1b FLAKY 20/21 upstream reliability is '88.23529411764706'. current run reliability is '95.23809523809523'. drift is -7.0028 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
NodesDecommissioningTest test_decommissioning_rebalancing_node {"shutdown_decommissioned": false} integration https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee7-4e3d-b834-478fdb10bc1b FLAKY 15/21 upstream reliability is '93.27731092436974'. current run reliability is '71.42857142857143'. drift is 21.84874 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_rebalancing_node
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee2-43b0-8f4f-d8c0077ad3b0 FLAKY 13/21 upstream reliability is '89.82300884955751'. current run reliability is '61.904761904761905'. drift is 27.91825 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

Comment thread src/v/cluster/partition.cc Outdated
log()->from_log_offset(_raft->start_offset()),
local_query_cfg.min_offset);

co_return co_await local_timequery(local_query_cfg, false);

@nvartolomei nvartolomei Nov 20, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an intuition that this is unnecessary.

Equivalent and simpler to comprehend (sample size = 1)?

 src/v/cluster/partition.cc | 15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git i/src/v/cluster/partition.cc w/src/v/cluster/partition.cc
index 81296fe36c..d6b7b0d866 100644
--- i/src/v/cluster/partition.cc
+++ w/src/v/cluster/partition.cc
@@ -620,7 +620,7 @@ partition::timequery(storage::timequery_config cfg) {
     const bool local_covers_offsets = local_start_offset <= cfg.max_offset;
     const bool may_answer_from_local = local_covers_timestamp
                                        && local_covers_offsets;
-    if (may_answer_from_local) {
+    if (may_answer_from_local || !may_answer_from_cloud) {
         // The query is ahead of the local data's start_timestamp and
         // potentially overlaps with the local data offset range: this means it
         // _might_ hit on local data: start_timestamp is not precise, so once we
@@ -650,18 +650,7 @@ partition::timequery(storage::timequery_config cfg) {
     // 3. the local log start timestamp is after the timequery's timestamp.
     // If 1 or 2 hold, there is no offset to return. Otherwise, fall back
     // to a local timequery, which should return the start of the log.
-    if (may_answer_from_local || !local_covers_offsets) {
-        co_return std::nullopt;
-    }
-
-    // Adjust the lower bound for the local query as the min_offset
-    // corresponds to the full log (including tiered storage).
-    auto local_query_cfg = cfg;
-    local_query_cfg.min_offset = std::max(
-      log()->from_log_offset(_raft->start_offset()),
-      local_query_cfg.min_offset);
-
-    co_return co_await local_timequery(local_query_cfg, false);
+    co_return std::nullopt;
 }

 bool partition::may_read_from_cloud() const {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this makes sense. Thanks.

nvartolomei
nvartolomei previously approved these changes Nov 20, 2025

@nvartolomei nvartolomei left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. As far as the fix goes it seems correct. The comment is a suggestion.

// The local storage hit a case where it needs to fall back
// to querying cloud storage.
co_return co_await cloud_storage_timequery(cfg);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add a comment about the intentional fallthrough? And that a nullopt from the local_timequery is a signal to check cloud

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think applying NV's suggestion makes this all clearer.

Comment on lines 621 to +633
local_query_cfg.min_offset = std::max(
log()->from_log_offset(_raft->start_offset()),
local_query_cfg.min_offset);

// If the min_offset is ahead of max_offset, the local log is empty
// or was truncated since the timequery_config was created.
if (local_query_cfg.min_offset > local_query_cfg.max_offset) {
co_return std::nullopt;
}
local_start_offset, local_query_cfg.min_offset);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous code, it was clearer that we would only ever call local_timequery() or cloud_storage_timequery() once. Is it a bug that we aren't preserving that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I guess that isn't true since there are cases where we called local_timequery() and then cloud_storage_timequery().

I'm finding the new code structure a bit confusing since it at first seems like we can call local_timequery() twice. But because the condition here is may_answer_from_local and we are conditioning on that below to return a nullopt, that isn't possible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With NV's suggestion we maintain this property.

// If the min_offset is ahead of max_offset, the local log is empty
// or was truncated since the timequery_config was created.
if (local_query_cfg.min_offset > local_query_cfg.max_offset) {
co_return std::nullopt;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure I'm seeing the bug through the refactor, would an equivalent fix have been to replace this line with this?

if (may_answer_from_cloud) {
  co_return co_await cloud_storage_timequery(cfg);
}
co_return std::nullopt;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See NV's suggestion.

@vbotbuildovich

vbotbuildovich commented Nov 24, 2025

Copy link
Copy Markdown
Collaborator

Retry command for Build#76900

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/timequery_test.py::TimeQueryTest.test_timequery_empty_local_log

@nvartolomei

Copy link
Copy Markdown
Contributor
DEBUG 2025-11-24 19:18:20,762 [shard 1:kafk] cluster - partition.cc:683 - timequery (raft) {kafka/tqtopic/0} cfg(k)={min_offset: 3072, max_offset: 3071, time:{timestamp: 1764098234000}, type_filter:batch_type::raft_data}
WARN  2025-11-24 19:18:20,763 [shard 0:kafk] kafka - connection_context.cc:1117 - Error processing request: std::runtime_error (ntp {kafka/tqtopic/0}: data offset 3071 is outside the translation range (starting at 3072))

@nvartolomei

nvartolomei commented Nov 25, 2025

Copy link
Copy Markdown
Contributor

I can swear this test passed locally when I proposed the patch.

Anyhow:

// One common way to get this error is when the client code tries to
// translate the end offset of an empty log (which is by convention
// prev(start_offset) if start_offset >= 0, and therefore lies outside
// the translation range). In this case the client code should detect
// that the offset range is empty and manually set the end of the
// translated range to prev(translated(start_offset)).

We need a special case for empty log. I guess we found out what the original code was attempting to do ... but incorrectly. We need the special case.

This change refactors timequery logic to fix a rare bug where a
timequery would return no result despite an offset matching the
timestamp existing in cloud storage and local storage.

The bug is observed in the tiered storage model test when querying
timestamps corresponding to offsets in the final, active segment of the
log of a partition with tiered storage enabled and with aggressive
cleanup settings. The partition gets into a state where the local log is
truncated up to the high watermark, but the active segment remains.
Querying for a timestamp within the bounds of the active segment
meant that the timestamp was after the start timestamp for the local
log, which comes from the base timestamp of the only segment, but the
start offset of the log was after the max offset set by the timequery,
which is the HWM and comes from the Kafka handler. This caused the
timequery to return no result, incorrectly, without checking local or
cloud storage.

The refactor fixes the bug and hopefully cleans up the logic a bit.
Regression test included.
It can happen that a timequery hits an empty local log and its max
offset is before the min offset, which is clamped to the start of the
log. This will fail translation. This change adds extra early returns to
local and cloud timequeries to handle this special case pre-translation.
@wdberkeley

Copy link
Copy Markdown
Contributor Author

Force push to rebase on dev, then a push to add a special case for the empty log, where max_offset can end up before min_offset (which is clamped to the start_offset).

@wdberkeley wdberkeley merged commit 0cc1bc2 into redpanda-data:dev Dec 5, 2025
19 checks passed
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.3.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.1.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v24.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants