kafka: catch shutdown errors in list_offsets and fetch#28358
Conversation
Following [1], timequeries may throw. I managed to trigger the throw in a test by injecting some short sleeps; it results in the Kafka request timing out after the timequery throws. This commit updates the Kafka layer timequery call to explicitly handle shutdown errors with an error code indicating we're no longer leader, which seems reasonable if we're shutting down. I left other errors as they were before, since it isn't clear what a good error code would be. 1. redpanda-data#28328
Following [1], make_reader() may throw if the underlying log is closed while processing the request (rather than asserting). As is, the throw in the handler causes the client request to time out. Instead, this updates the shutdown case to return an error code indicating that we are no longer leader (which seems reasonable if the partition is shutting down). 1. redpanda-data#28328
There was a problem hiding this comment.
Pull Request Overview
This PR adds exception handling for shutdown scenarios in Kafka request handlers. When the storage layer throws exceptions due to partition shutdown, the handlers now catch these exceptions and return not_leader_for_partition error codes instead of propagating uncaught exceptions.
Key changes:
- Wraps potentially-throwing storage operations in
ss::coroutine::as_future()to catch exceptions - Checks for shutdown exceptions using
ssx::is_shutdown_exception() - Returns appropriate Kafka error codes for shutdown scenarios
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/v/kafka/server/handlers/list_offsets.cc | Adds exception handling around timequery() call to catch shutdown exceptions and return not_leader_for_partition error |
| src/v/kafka/server/handlers/fetch.cc | Adds exception handling around read_from_partition() call to catch shutdown exceptions and return not_leader_for_partition error |
| std::rethrow_exception(ex); | ||
| } | ||
|
|
||
| // Note that units can be both increased and decreassed here. Increases |
There was a problem hiding this comment.
Corrected spelling of 'decreassed' to 'decreased'.
| // Note that units can be both increased and decreassed here. Increases | |
| // Note that units can be both increased and decreased here. Increases |
Retry command for Build#75614please wait until all jobs are finished before running the slash command |
CI test resultstest results on build#75614
|
|
/backport v25.3.x |
|
/backport v25.2.x |
|
/backport v25.1.x |
|
/backport v24.3.x |
|
Branch name "v25.3.x" not found. |
|
Failed to create a backport PR to v25.1.x branch. I tried: |
|
Failed to create a backport PR to v24.3.x branch. I tried: |
|
Failed to create a backport PR to v25.2.x branch. I tried: |
Catches a couple throws that are possible in the Kafka layer when reaching into the storage layer, following #28328.
In both cases, we'll return an error code indicating we're no longer leader, which seems reasonable if we've shut the partition down.
I considered instead implementing a more invasive change that made the replicated_partition return a result type, but opted for this to avoid a bunch of refactoring.
Backports Required
Release Notes