Deflake offset for leader epoch#28389
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR addresses a flaky test issue in the offset_for_leader_epoch handler by improving leadership validation and introducing concurrent processing. The main change replaces the cached is_leader() check with a linearizable_barrier() to ensure accurate, real-time leadership status at the moment the request is received, preventing stale data from causing test failures.
Key changes:
- Replaced cached leadership check with
linearizable_barrier()for accurate real-time validation - Introduced concurrent processing using
ss::max_concurrent_for_eachwith parallelism limit of 32 - Restructured the request processing loop into a lambda-based concurrent execution model
Retry command for Build#75708please wait until all jobs are finished before running the slash command |
CI test resultstest results on build#75708test results on build#75830
test results on build#76764
|
45bed9a to
de850ef
Compare
|
/ci-repeat 4 |
Describe paritition can occasionally return -1 aka UNKNOWN_EPOCH for a partition's term. This epoch is subsequently fed into offset_for_leader_epoch. This is an illegal value for offset_for_leader_epoch. This commit changes the logic for gathering partition data to require first that all parition descriptions returned have a valid high watermark and a valid epoch.
de850ef to
54b62b6
Compare
|
I researched this again and found the following: [DEBUG - 2025-10-29 16:51:31,697 - offset_for_leader_epoch_test - test_offset_for_leader_epoch_transfer - lineno:274]: Fetched offsets for epoch -1 : {... 30: 0, ...}, expected: {30: 612} relevant invocation of offfset_for_leader_epoch: where did we get -1 for the epoch? ... /redpanda/redpanda/vbuild/redpanda_installs/ci/bin/rpk', 'topic', '-X', 'brokers=docker-rp-7:9092,docker-rp-22:9092,docker-rp-17:9092,docker-rp-23:9092,docker-rp-18:9092', 'describe', 'topic-kggcsntgbk', '-p', '-X', 'globals.request_timeout_overhead=30s', '-v'] RPK returned -1 for describe topic which was fed into offset_for_leader_epoch, which is an invalid value to provide, this results in an expectation mismatch between the offset 612, and the default offset which gets returned for an invalid partition which is the log_start offset Should we check the provided epoch to guarantee that it is a valid epoch? |
|
@joe-redpanda can we wait in the test for valid set of epochs ? |
That's whats happening. def _get_offsets_and_epochs(self, rpk: RpkTool, topic_name: str):
offsets = []
def refresh():
result = rpk.describe_topic(topic_name)
offsets.clear()
offsets.extend(result)
def all_offsets_valid():
refresh()
# metadata request may return INVALID_EPOCH aka -1
# this should not be used because INVALID_EPOCH maps to latest available
# epoch in OffsetForLeaderEpochRequest
return all([p.high_watermark >= 0 and p.leader_epoch >= 0 for p in offsets])
wait_until(all_offsets_valid, 30, 1)
return offsetswe're doing a 30s wait until the condition is met where the condition is |
|
A good question, though, is should we strike down a user request with 'invalid_request' when epoch is < 0 |
I"m curious why epoch is -1 despite so many retries? |
previously the criteria was only p.high_watermark >= 0, with no epoch check heres the old 'all' command def all_offsets_valid():
refresh()
return all([p.high_watermark >= 0 for p in offsets])I bumped this to a local function because both tests are susceptible to this. |
|
/backport v25.3.x |
|
/backport v25.2.x |
|
/backport v25.1.x |
|
/backport v24.3.x |
Describe paritition can occasionally return -1 aka UNKNOWN_EPOCH for a
partition's term.
This epoch is subsequently fed into offset_for_leader_epoch.
This is an illegal value for offset_for_leader_epoch.
This commit changes the logic for gathering partition data to require
first that all parition descriptions returned have a valid high
watermark and a valid epoch.
Backports Required
Release Notes
Bug Fixes