partition/timequery: Fix tiered storage timequery bug by wdberkeley · Pull Request #28642 · redpanda-data/redpanda

wdberkeley · 2025-11-19T21:21:46Z

This change refactors timequery logic to fix a rare bug where a timequery would return no result despite an offset matching the timestamp existing in cloud storage and local storage.

The bug is observed in the tiered storage model test when querying timestamps corresponding to offsets in the final, active segment of the log of a partition with tiered storage enabled and with aggressive cleanup settings. The partition gets into a state where the local log is truncated up to the high watermark, but the active segment remains. Querying for a timestamp within the bounds of the active segment meant that the timestamp was after the start timestamp for the local log, which comes from the base timestamp of the only segment, but the start offset of the log was after the max offset set by the timequery, which is the HWM and comes from the Kafka handler. This caused the timequery to return no result, incorrectly, without checking local or cloud storage.

The refactor fixes the bug and hopefully cleans up the logic a bit. Regression test included.

Fixes CORE-14569

Backports Required

Release Notes

Bug Fixes

Fixed a rare bug that causes timequeries against partitions using tiered storage to incorrectly return no result when the partition's local log is empty but retains an active segment.

Copilot

Pull Request Overview

This PR fixes a rare bug in tiered storage timequery logic where queries could incorrectly return no result despite matching offsets existing in cloud or local storage. The bug occurred when the local log was truncated to the high watermark but retained an active segment, causing the timequery to incorrectly reject queries when the start offset was after the max offset but the timestamp was within bounds.

Key changes:

Refactored partition::timequery() to properly handle edge cases when local log is truncated but has an active segment
Added early validation that min_offset <= max_offset before processing
Introduced may_answer_from_local logic that checks both timestamp and offset coverage
Added comprehensive regression test that reproduces the bug scenario with short retention and cloud data

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`src/v/cluster/partition.cc`	Refactored timequery logic to fix bug by adding offset range validation and improving local/cloud storage decision logic
`src/v/cloud_storage/tests/cloud_storage_e2e_test.cc`	Added regression test that reproduces the bug by setting up partition with short retention, uploading to cloud, and verifying timequery correctness

vbotbuildovich · 2025-11-19T23:43:56Z

CI test results

test results on build#76683

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
DataMigrationsApiTest	test_higher_level_migration_api	null	integration	https://buildkite.com/redpanda/redpanda/builds/76683#019a9e1b-6589-44f5-b422-f52f5bde80b2	FLAKY	20/21	upstream reliability is '99.41747572815534'. current run reliability is '95.23809523809523'. drift is 4.17938 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_higher_level_migration_api
MountUnmountIcebergTest	test_simple_remount	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/76683#019a9e1d-ac19-465f-8f3e-32490d71688e	FLAKY	16/21	upstream reliability is '79.19227392449517'. current run reliability is '76.19047619047619'. drift is 3.0018 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
MultiRestartTest	test_recovery_after_multiple_restarts	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/76683#019a9e1d-ac24-4e3d-8e88-351b78aa9c2e	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MultiRestartTest&test_method=test_recovery_after_multiple_restarts
WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/76683#019a9e1d-ac20-4c51-be59-b163a0fd46ee	FLAKY	19/21	upstream reliability is '91.27310061601642'. current run reliability is '90.47619047619048'. drift is 0.79691 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

test results on build#76900

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
MultiRestartTest	test_recovery_after_multiple_restarts	{"cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/76900#019ab729-d35a-42d4-9277-b901ed71f32f	FLAKY	20/21	upstream reliability is '95.89905362776025'. current run reliability is '95.23809523809523'. drift is 0.66096 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MultiRestartTest&test_method=test_recovery_after_multiple_restarts
TimeQueryTest	test_timequery_empty_local_log	null	integration	https://buildkite.com/redpanda/redpanda/builds/76900#019ab729-3881-4fbd-b809-52cb49df1e98	FLAKY	6/21	upstream reliability is '100.0'. current run reliability is '28.57142857142857'. drift is 71.42857 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TimeQueryTest&test_method=test_timequery_empty_local_log
TimeQueryTest	test_timequery_empty_local_log	null	integration	https://buildkite.com/redpanda/redpanda/builds/76900#019ab729-d359-48bd-a5b9-a8083ab8438d	FLAKY	6/21	upstream reliability is '100.0'. current run reliability is '28.57142857142857'. drift is 71.42857 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TimeQueryTest&test_method=test_timequery_empty_local_log

test results on build#77366

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "gzip"}	integration	https://buildkite.com/redpanda/redpanda/builds/77366#019aeb99-9f85-49b7-85c5-d2f53bb579b5	FLAKY	20/21	upstream reliability is '88.20224719101124'. current run reliability is '95.23809523809523'. drift is -7.03585 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "gzip"}	integration	https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee2-43b0-8f4f-d8c0077ad3b0	FLAKY	20/21	upstream reliability is '88.20224719101124'. current run reliability is '95.23809523809523'. drift is -7.03585 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "lz4"}	integration	https://buildkite.com/redpanda/redpanda/builds/77366#019aeb99-9f87-4622-a59a-c6c09173769d	FLAKY	20/21	upstream reliability is '88.23529411764706'. current run reliability is '95.23809523809523'. drift is -7.0028 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "lz4"}	integration	https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee4-4b34-8ac3-7c952af48362	FLAKY	20/21	upstream reliability is '88.23529411764706'. current run reliability is '95.23809523809523'. drift is -7.0028 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "snappy"}	integration	https://buildkite.com/redpanda/redpanda/builds/77366#019aeb99-9f88-4867-be14-f747b3ed8692	FLAKY	20/21	upstream reliability is '88.23529411764706'. current run reliability is '95.23809523809523'. drift is -7.0028 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "snappy"}	integration	https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee5-4392-bfaf-261b3e833853	FLAKY	19/21	upstream reliability is '88.23529411764706'. current run reliability is '90.47619047619048'. drift is -2.2409 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "zstd"}	integration	https://buildkite.com/redpanda/redpanda/builds/77366#019aeb99-9f8a-4e65-81b1-4fbb50a32e21	FLAKY	20/21	upstream reliability is '88.23529411764706'. current run reliability is '95.23809523809523'. drift is -7.0028 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
JavaCompressionTest	test_upgrade_java_compression	{"compression_type": "zstd"}	integration	https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee7-4e3d-b834-478fdb10bc1b	FLAKY	20/21	upstream reliability is '88.23529411764706'. current run reliability is '95.23809523809523'. drift is -7.0028 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression
NodesDecommissioningTest	test_decommissioning_rebalancing_node	{"shutdown_decommissioned": false}	integration	https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee7-4e3d-b834-478fdb10bc1b	FLAKY	15/21	upstream reliability is '93.27731092436974'. current run reliability is '71.42857142857143'. drift is 21.84874 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_rebalancing_node
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": true}	integration	https://buildkite.com/redpanda/redpanda/builds/77366#019aeb9a-2ee2-43b0-8f4f-d8c0077ad3b0	FLAKY	13/21	upstream reliability is '89.82300884955751'. current run reliability is '61.904761904761905'. drift is 27.91825 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

nvartolomei · 2025-11-20T21:09:43Z

+      log()->from_log_offset(_raft->start_offset()),
+      local_query_cfg.min_offset);
+
+    co_return co_await local_timequery(local_query_cfg, false);


I have an intuition that this is unnecessary.

Equivalent and simpler to comprehend (sample size = 1)?

src/v/cluster/partition.cc | 15 ++------------- 1 file changed, 2 insertions(+), 13 deletions(-) diff --git i/src/v/cluster/partition.cc w/src/v/cluster/partition.cc index 81296fe36c..d6b7b0d866 100644 --- i/src/v/cluster/partition.cc +++ w/src/v/cluster/partition.cc @@ -620,7 +620,7 @@ partition::timequery(storage::timequery_config cfg) { const bool local_covers_offsets = local_start_offset <= cfg.max_offset; const bool may_answer_from_local = local_covers_timestamp && local_covers_offsets; - if (may_answer_from_local) { + if (may_answer_from_local || !may_answer_from_cloud) { // The query is ahead of the local data's start_timestamp and // potentially overlaps with the local data offset range: this means it // _might_ hit on local data: start_timestamp is not precise, so once we @@ -650,18 +650,7 @@ partition::timequery(storage::timequery_config cfg) { // 3. the local log start timestamp is after the timequery's timestamp. // If 1 or 2 hold, there is no offset to return. Otherwise, fall back // to a local timequery, which should return the start of the log. - if (may_answer_from_local || !local_covers_offsets) { - co_return std::nullopt; - } - - // Adjust the lower bound for the local query as the min_offset - // corresponds to the full log (including tiered storage). - auto local_query_cfg = cfg; - local_query_cfg.min_offset = std::max( - log()->from_log_offset(_raft->start_offset()), - local_query_cfg.min_offset); - - co_return co_await local_timequery(local_query_cfg, false); + co_return std::nullopt; } bool partition::may_read_from_cloud() const {

Yeah, I think this makes sense. Thanks.

nvartolomei

Approving. As far as the fix goes it seems correct. The comment is a suggestion.

andrwng · 2025-11-21T00:06:55Z

-            // The local storage hit a case where it needs to fall back
-            // to querying cloud storage.
-            co_return co_await cloud_storage_timequery(cfg);
        }


nit: can you add a comment about the intentional fallthrough? And that a nullopt from the local_timequery is a signal to check cloud

I think applying NV's suggestion makes this all clearer.

andrwng · 2025-11-21T00:10:20Z

        local_query_cfg.min_offset = std::max(
-          log()->from_log_offset(_raft->start_offset()),
-          local_query_cfg.min_offset);
-
-        // If the min_offset is ahead of max_offset, the local log is empty
-        // or was truncated since the timequery_config was created.
-        if (local_query_cfg.min_offset > local_query_cfg.max_offset) {
-            co_return std::nullopt;
-        }
+          local_start_offset, local_query_cfg.min_offset);


In the previous code, it was clearer that we would only ever call local_timequery() or cloud_storage_timequery() once. Is it a bug that we aren't preserving that?

Actually I guess that isn't true since there are cases where we called local_timequery() and then cloud_storage_timequery().

I'm finding the new code structure a bit confusing since it at first seems like we can call local_timequery() twice. But because the condition here is may_answer_from_local and we are conditioning on that below to return a nullopt, that isn't possible.

With NV's suggestion we maintain this property.

andrwng · 2025-11-21T00:12:19Z

-        // If the min_offset is ahead of max_offset, the local log is empty
-        // or was truncated since the timequery_config was created.
-        if (local_query_cfg.min_offset > local_query_cfg.max_offset) {
-            co_return std::nullopt;


Just making sure I'm seeing the bug through the refactor, would an equivalent fix have been to replace this line with this?

if (may_answer_from_cloud) { co_return co_await cloud_storage_timequery(cfg); } co_return std::nullopt;

See NV's suggestion.

vbotbuildovich · 2025-11-24T19:41:27Z

Retry command for Build#76900

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/timequery_test.py::TimeQueryTest.test_timequery_empty_local_log

nvartolomei · 2025-11-25T10:05:49Z

DEBUG 2025-11-24 19:18:20,762 [shard 1:kafk] cluster - partition.cc:683 - timequery (raft) {kafka/tqtopic/0} cfg(k)={min_offset: 3072, max_offset: 3071, time:{timestamp: 1764098234000}, type_filter:batch_type::raft_data}
WARN  2025-11-24 19:18:20,763 [shard 0:kafk] kafka - connection_context.cc:1117 - Error processing request: std::runtime_error (ntp {kafka/tqtopic/0}: data offset 3071 is outside the translation range (starting at 3072))

nvartolomei · 2025-11-25T10:12:42Z

I can swear this test passed locally when I proposed the patch.

Anyhow:

redpanda/src/v/storage/offset_translator_state.cc

Lines 42 to 47 in 69e5210

    
           // One common way to get this error is when the client code tries to 
        
           // translate the end offset of an empty log (which is by convention 
        
           // prev(start_offset) if start_offset >= 0, and therefore lies outside 
        
           // the translation range). In this case the client code should detect 
        
           // that the offset range is empty and manually set the end of the 
        
           // translated range to prev(translated(start_offset)).

We need a special case for empty log. I guess we found out what the original code was attempting to do ... but incorrectly. We need the special case.

This change refactors timequery logic to fix a rare bug where a timequery would return no result despite an offset matching the timestamp existing in cloud storage and local storage. The bug is observed in the tiered storage model test when querying timestamps corresponding to offsets in the final, active segment of the log of a partition with tiered storage enabled and with aggressive cleanup settings. The partition gets into a state where the local log is truncated up to the high watermark, but the active segment remains. Querying for a timestamp within the bounds of the active segment meant that the timestamp was after the start timestamp for the local log, which comes from the base timestamp of the only segment, but the start offset of the log was after the max offset set by the timequery, which is the HWM and comes from the Kafka handler. This caused the timequery to return no result, incorrectly, without checking local or cloud storage. The refactor fixes the bug and hopefully cleans up the logic a bit. Regression test included.

It can happen that a timequery hits an empty local log and its max offset is before the min offset, which is clamped to the start of the log. This will fail translation. This change adds extra early returns to local and cloud timequeries to handle this special case pre-translation.

wdberkeley · 2025-12-04T22:27:44Z

Force push to rebase on dev, then a push to add a special case for the empty log, where max_offset can end up before min_offset (which is clamped to the start_offset).

vbotbuildovich · 2025-12-05T18:44:43Z

/backport v25.3.x

vbotbuildovich · 2025-12-05T18:44:44Z

/backport v25.2.x

vbotbuildovich · 2025-12-05T18:44:45Z

/backport v25.1.x

vbotbuildovich · 2025-12-05T18:44:45Z

/backport v24.3.x

Copilot AI review requested due to automatic review settings November 19, 2025 21:21

github-actions Bot added the area/redpanda label Nov 19, 2025

Copilot AI reviewed Nov 19, 2025

View reviewed changes

wdberkeley requested review from andrwng and nvartolomei November 20, 2025 17:18

wdberkeley mentioned this pull request Nov 20, 2025

Add early return logging to timequeries #28485

Closed

8 tasks

nvartolomei reviewed Nov 20, 2025

View reviewed changes

nvartolomei previously approved these changes Nov 20, 2025

View reviewed changes

andrwng reviewed Nov 21, 2025

View reviewed changes

wdberkeley dismissed nvartolomei’s stale review via 2cc4c4e November 24, 2025 18:07

wdberkeley requested review from andrwng and nvartolomei November 24, 2025 18:07

wdberkeley added 2 commits December 4, 2025 14:25

partition/timequery: Simplify tq logic further

c0b2445

wdberkeley force-pushed the timequery-bad branch from 2cc4c4e to c0b2445 Compare December 4, 2025 22:25

andrwng approved these changes Dec 5, 2025

View reviewed changes

wdberkeley merged commit 0cc1bc2 into redpanda-data:dev Dec 5, 2025
19 checks passed

This was referenced Dec 5, 2025

[v25.2.x] partition/timequery: Fix tiered storage timequery bug #28870

Closed

[v25.3.x] partition/timequery: Fix tiered storage timequery bug #28871

Merged

[v25.1.x] partition/timequery: Fix tiered storage timequery bug #28872

Closed

vbotbuildovich mentioned this pull request Dec 5, 2025

[v24.3.x] partition/timequery: Fix tiered storage timequery bug #28873

Closed

This was referenced Dec 8, 2025

[v25.2.x] partition/timequery: Fix tiered storage timequery bug (MANUAL BACKPORT) #28898

Merged

[v25.1.x] partition/timequery: Fix tiered storage timequery bug (MANUAL BACKPORT) #28899

Merged

Conversation

wdberkeley commented Nov 19, 2025

Backports Required

Release Notes

Bug Fixes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

vbotbuildovich commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

nvartolomei Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvartolomei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retry command for Build#76900

Uh oh!

nvartolomei commented Nov 25, 2025

Uh oh!

nvartolomei commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wdberkeley commented Dec 4, 2025

Uh oh!

Uh oh!

vbotbuildovich commented Dec 5, 2025

Uh oh!

vbotbuildovich commented Dec 5, 2025

Uh oh!

vbotbuildovich commented Dec 5, 2025

Uh oh!

vbotbuildovich commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vbotbuildovich commented Nov 19, 2025 •

edited

Loading

nvartolomei Nov 20, 2025 •

edited

Loading

vbotbuildovich commented Nov 24, 2025 •

edited

Loading

nvartolomei commented Nov 25, 2025 •

edited

Loading