kafka: catch shutdown errors in list_offsets and fetch by andrwng · Pull Request #28358 · redpanda-data/redpanda

andrwng · 2025-11-05T01:27:21Z

Catches a couple throws that are possible in the Kafka layer when reaching into the storage layer, following #28328.

In both cases, we'll return an error code indicating we're no longer leader, which seems reasonable if we've shut the partition down.

I considered instead implementing a more invasive change that made the replicated_partition return a result type, but opted for this to avoid a bunch of refactoring.

Backports Required

Release Notes

None

Following [1], timequeries may throw. I managed to trigger the throw in a test by injecting some short sleeps; it results in the Kafka request timing out after the timequery throws. This commit updates the Kafka layer timequery call to explicitly handle shutdown errors with an error code indicating we're no longer leader, which seems reasonable if we're shutting down. I left other errors as they were before, since it isn't clear what a good error code would be. 1. redpanda-data#28328

Following [1], make_reader() may throw if the underlying log is closed while processing the request (rather than asserting). As is, the throw in the handler causes the client request to time out. Instead, this updates the shutdown case to return an error code indicating that we are no longer leader (which seems reasonable if the partition is shutting down). 1. redpanda-data#28328

Copilot

Pull Request Overview

This PR adds exception handling for shutdown scenarios in Kafka request handlers. When the storage layer throws exceptions due to partition shutdown, the handlers now catch these exceptions and return not_leader_for_partition error codes instead of propagating uncaught exceptions.

Key changes:

Wraps potentially-throwing storage operations in ss::coroutine::as_future() to catch exceptions
Checks for shutdown exceptions using ssx::is_shutdown_exception()
Returns appropriate Kafka error codes for shutdown scenarios

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
src/v/kafka/server/handlers/list_offsets.cc	Adds exception handling around `timequery()` call to catch shutdown exceptions and return not_leader_for_partition error
src/v/kafka/server/handlers/fetch.cc	Adds exception handling around `read_from_partition()` call to catch shutdown exceptions and return not_leader_for_partition error

Copilot · 2025-11-05T01:27:45Z

+        std::rethrow_exception(ex);
+    }

    // Note that units can be both increased and decreassed here. Increases


Corrected spelling of 'decreassed' to 'decreased'.

Suggested change

// Note that units can be both increased and decreassed here. Increases

// Note that units can be both increased and decreased here. Increases

vbotbuildovich · 2025-11-05T04:41:00Z

Retry command for Build#75614

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":true}

vbotbuildovich · 2025-11-05T05:24:20Z

CI test results

test results on build#75614

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason	test_history
DataMigrationsApiTest	test_creating_and_listing_migrations	null	integration	https://buildkite.com/redpanda/redpanda/builds/75614#019a51bc-161f-486e-8614-f18b41202eee	FLAKY	17/21	upstream reliability is '96.16497829232996'. current run reliability is '80.95238095238095'. drift is 15.2126 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
LogCompactionTxRemovalTest	test_tx_control_batch_removal	null	integration	https://buildkite.com/redpanda/redpanda/builds/75614#019a51bc-161d-4580-9564-588a6892bf74	FLAKY	13/21	upstream reliability is '86.74556213017752'. current run reliability is '61.904761904761905'. drift is 24.8408 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
NodesDecommissioningTest	test_decommissioning_finishes_after_manual_cancellation	{"cloud_topic": false, "delete_topic": false}	integration	https://buildkite.com/redpanda/redpanda/builds/75614#019a51bd-9551-445b-9228-1b113c497e8f	FLAKY	19/21	upstream reliability is '98.17444219066938'. current run reliability is '90.47619047619048'. drift is 7.69825 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_finishes_after_manual_cancellation
SIPartitionMovementTest	test_cross_shard	{"cloud_storage_type": 2, "num_to_upgrade": 2, "with_cloud_topics": false}	integration	https://buildkite.com/redpanda/redpanda/builds/75614#019a51bd-955a-4ac3-989a-97aff7b02340	FLAKY	18/21	upstream reliability is '100.0'. current run reliability is '85.71428571428571'. drift is 14.28571 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/75614#019a51bc-161e-4ec8-864c-db0817dfb196	FLAKY	12/21	upstream reliability is '100.0'. current run reliability is '57.14285714285714'. drift is 42.85714 and the allowed drift is set to 50. The test should PASS	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
RedpandaNodeOperationsSmokeTest	test_node_ops_smoke_test	{"cloud_storage_type": 1, "mixed_versions": true}	integration	https://buildkite.com/redpanda/redpanda/builds/75614#019a51bc-161f-486e-8614-f18b41202eee	FLAKY	8/21	upstream reliability is '97.57575757575758'. current run reliability is '38.095238095238095'. drift is 59.48052 and the allowed drift is set to 50. The test should FAIL	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

vbotbuildovich · 2025-11-05T09:38:34Z

/backport v25.3.x

vbotbuildovich · 2025-11-05T09:38:35Z

/backport v25.2.x

vbotbuildovich · 2025-11-05T09:38:36Z

/backport v25.1.x

vbotbuildovich · 2025-11-05T09:38:37Z

/backport v24.3.x

vbotbuildovich · 2025-11-05T09:38:59Z

Branch name "v25.3.x" not found.

Workflow run logs.

vbotbuildovich · 2025-11-05T09:39:48Z

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-28358-v25.1.x-919 remotes/upstream/v25.1.x
git cherry-pick -x cb957c983f 1434640d9e

Workflow run logs.

vbotbuildovich · 2025-11-05T09:39:50Z

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-28358-v24.3.x-12 remotes/upstream/v24.3.x
git cherry-pick -x cb957c983f 1434640d9e

Workflow run logs.

vbotbuildovich · 2025-11-05T09:39:57Z

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-28358-v25.2.x-231 remotes/upstream/v25.2.x
git cherry-pick -x cb957c983f 1434640d9e

Workflow run logs.

andrwng added 2 commits November 4, 2025 17:08

Copilot AI review requested due to automatic review settings November 5, 2025 01:27

github-actions Bot added the area/redpanda label Nov 5, 2025

Copilot AI reviewed Nov 5, 2025

View reviewed changes

andrwng changed the title ~~kafka:~~ kafka: catch shutdown errors in list_offsets and fetch Nov 5, 2025

rockwotj approved these changes Nov 5, 2025

View reviewed changes

andrwng enabled auto-merge November 5, 2025 02:40

andrwng disabled auto-merge November 5, 2025 02:40

andrwng enabled auto-merge November 5, 2025 02:40

andrwng merged commit 79d3602 into redpanda-data:dev Nov 5, 2025
19 checks passed

This was referenced Nov 5, 2025

[v25.1.x] kafka: catch shutdown errors in list_offsets and fetch #28361

Open

[v24.3.x] kafka: catch shutdown errors in list_offsets and fetch #28362

Open

vbotbuildovich mentioned this pull request Nov 5, 2025

[v25.2.x] kafka: catch shutdown errors in list_offsets and fetch #28363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kafka: catch shutdown errors in list_offsets and fetch#28358

kafka: catch shutdown errors in list_offsets and fetch#28358
andrwng merged 2 commits into
redpanda-data:devfrom
andrwng:kafka-catch-shutdown-ex

andrwng commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	// Note that units can be both increased and decreassed here. Increases
	// Note that units can be both increased and decreased here. Increases

Conversation

andrwng commented Nov 5, 2025

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

vbotbuildovich commented Nov 5, 2025

Retry command for Build#75614

Uh oh!

vbotbuildovich commented Nov 5, 2025

CI test results

Uh oh!

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

vbotbuildovich commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants