Skip to content

kafka: catch shutdown errors in list_offsets and fetch#28358

Merged
andrwng merged 2 commits into
redpanda-data:devfrom
andrwng:kafka-catch-shutdown-ex
Nov 5, 2025
Merged

kafka: catch shutdown errors in list_offsets and fetch#28358
andrwng merged 2 commits into
redpanda-data:devfrom
andrwng:kafka-catch-shutdown-ex

Conversation

@andrwng

@andrwng andrwng commented Nov 5, 2025

Copy link
Copy Markdown
Contributor

Catches a couple throws that are possible in the Kafka layer when reaching into the storage layer, following #28328.

In both cases, we'll return an error code indicating we're no longer leader, which seems reasonable if we've shut the partition down.

I considered instead implementing a more invasive change that made the replicated_partition return a result type, but opted for this to avoid a bunch of refactoring.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • None

Following [1], timequeries may throw. I managed to trigger the throw in
a test by injecting some short sleeps; it results in the Kafka request
timing out after the timequery throws.

This commit updates the Kafka layer timequery call to explicitly handle
shutdown errors with an error code indicating we're no longer leader,
which seems reasonable if we're shutting down. I left other errors as
they were before, since it isn't clear what a good error code would be.

1. redpanda-data#28328
Following [1], make_reader() may throw if the underlying log is closed
while processing the request (rather than asserting). As is, the throw
in the handler causes the client request to time out. Instead, this
updates the shutdown case to return an error code indicating that we
are no longer leader (which seems reasonable if the partition is
shutting down).

1. redpanda-data#28328
Copilot AI review requested due to automatic review settings November 5, 2025 01:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds exception handling for shutdown scenarios in Kafka request handlers. When the storage layer throws exceptions due to partition shutdown, the handlers now catch these exceptions and return not_leader_for_partition error codes instead of propagating uncaught exceptions.

Key changes:

  • Wraps potentially-throwing storage operations in ss::coroutine::as_future() to catch exceptions
  • Checks for shutdown exceptions using ssx::is_shutdown_exception()
  • Returns appropriate Kafka error codes for shutdown scenarios

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/v/kafka/server/handlers/list_offsets.cc Adds exception handling around timequery() call to catch shutdown exceptions and return not_leader_for_partition error
src/v/kafka/server/handlers/fetch.cc Adds exception handling around read_from_partition() call to catch shutdown exceptions and return not_leader_for_partition error

std::rethrow_exception(ex);
}

// Note that units can be both increased and decreassed here. Increases

Copilot AI Nov 5, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'decreassed' to 'decreased'.

Suggested change
// Note that units can be both increased and decreassed here. Increases
// Note that units can be both increased and decreased here. Increases

Copilot uses AI. Check for mistakes.
@andrwng andrwng changed the title kafka: kafka: catch shutdown errors in list_offsets and fetch Nov 5, 2025
@andrwng andrwng enabled auto-merge November 5, 2025 02:40
@andrwng andrwng disabled auto-merge November 5, 2025 02:40
@andrwng andrwng enabled auto-merge November 5, 2025 02:40
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Retry command for Build#75614

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/random_node_operations_smoke_test.py::RedpandaNodeOperationsSmokeTest.test_node_ops_smoke_test@{"cloud_storage_type":1,"mixed_versions":true}

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#75614
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
DataMigrationsApiTest test_creating_and_listing_migrations null integration https://buildkite.com/redpanda/redpanda/builds/75614#019a51bc-161f-486e-8614-f18b41202eee FLAKY 17/21 upstream reliability is '96.16497829232996'. current run reliability is '80.95238095238095'. drift is 15.2126 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
LogCompactionTxRemovalTest test_tx_control_batch_removal null integration https://buildkite.com/redpanda/redpanda/builds/75614#019a51bc-161d-4580-9564-588a6892bf74 FLAKY 13/21 upstream reliability is '86.74556213017752'. current run reliability is '61.904761904761905'. drift is 24.8408 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
NodesDecommissioningTest test_decommissioning_finishes_after_manual_cancellation {"cloud_topic": false, "delete_topic": false} integration https://buildkite.com/redpanda/redpanda/builds/75614#019a51bd-9551-445b-9228-1b113c497e8f FLAKY 19/21 upstream reliability is '98.17444219066938'. current run reliability is '90.47619047619048'. drift is 7.69825 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_finishes_after_manual_cancellation
SIPartitionMovementTest test_cross_shard {"cloud_storage_type": 2, "num_to_upgrade": 2, "with_cloud_topics": false} integration https://buildkite.com/redpanda/redpanda/builds/75614#019a51bd-955a-4ac3-989a-97aff7b02340 FLAKY 18/21 upstream reliability is '100.0'. current run reliability is '85.71428571428571'. drift is 14.28571 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/75614#019a51bc-161e-4ec8-864c-db0817dfb196 FLAKY 12/21 upstream reliability is '100.0'. current run reliability is '57.14285714285714'. drift is 42.85714 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": true} integration https://buildkite.com/redpanda/redpanda/builds/75614#019a51bc-161f-486e-8614-f18b41202eee FLAKY 8/21 upstream reliability is '97.57575757575758'. current run reliability is '38.095238095238095'. drift is 59.48052 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

@andrwng andrwng merged commit 79d3602 into redpanda-data:dev Nov 5, 2025
19 checks passed
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.3.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.1.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v24.3.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Branch name "v25.3.x" not found.

Workflow run logs.

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-28358-v25.1.x-919 remotes/upstream/v25.1.x
git cherry-pick -x cb957c983f 1434640d9e

Workflow run logs.

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-28358-v24.3.x-12 remotes/upstream/v24.3.x
git cherry-pick -x cb957c983f 1434640d9e

Workflow run logs.

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-28358-v25.2.x-231 remotes/upstream/v25.2.x
git cherry-pick -x cb957c983f 1434640d9e

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants