Skip to content

DEVPROD-3638 fix(tests): add HTTP retry logic for transient connectio…#29238

Merged
cjayani merged 3 commits into
devfrom
DEVPROD-3638
Mar 3, 2026
Merged

DEVPROD-3638 fix(tests): add HTTP retry logic for transient connectio…#29238
cjayani merged 3 commits into
devfrom
DEVPROD-3638

Conversation

@cjayani

@cjayani cjayani commented Jan 13, 2026

Copy link
Copy Markdown
Contributor

Jira: DEVPROD-3638

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Improvements

  • Adds automatic retry with backoff for transient HTTP connection errors in Cloud API and S3 package download calls, reducing intermittent CDT test failures caused by RemoteDisconnected errors.

@cjayani cjayani requested a review from rpdevmp January 13, 2026 14:21
@cjayani

cjayani commented Jan 13, 2026

Copy link
Copy Markdown
Contributor Author

/cdt

Comment thread tests/rptest/services/provider_clients/rpcloud_client.py Outdated
@vbotbuildovich

vbotbuildovich commented Jan 14, 2026

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#79015
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/79015#019bbc48-b89c-4e1b-99b8-1888e0682e5f FLAKY 8/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1982, p0=0.6188, reject_threshold=0.0100. adj_baseline=0.4846, p1=0.0664, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
test results on build#80981
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
NodesDecommissioningTest test_decommission_status null integration https://buildkite.com/redpanda/redpanda/builds/80981#019c8f46-9a19-46b1-965c-0f3cc4609149 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0468, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1340, p1=0.2371, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommission_status
test results on build#80999
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
HighThroughputTest test_decommission_and_add null integration https://buildkite.com/redpanda/redpanda/builds/80999#019c9080-2e58-4812-85cd-8e866b0a0776 FAIL 0/11 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=HighThroughputTest&test_method=test_decommission_and_add
HighThroughputTest test_decommission_and_add null integration https://buildkite.com/redpanda/redpanda/builds/80999#019c9081-055f-44fa-b77f-c670a80c8909 FAIL 0/11 The test was found to be new, and no failures are allowed https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=HighThroughputTest&test_method=test_decommission_and_add
test results on build#81342
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/81342#019cb32f-cc28-43e0-80d9-7c1d2d7a2c21 FLAKY 19/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0394, p0=0.5522, reject_threshold=0.0100. adj_baseline=0.1135, p1=0.3199, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test

updating feature branch with all latest changes from dev branch
@vbotbuildovich

vbotbuildovich commented Feb 24, 2026

Copy link
Copy Markdown
Collaborator

Retry command for Build#80999

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/redpanda_cloud_tests/high_throughput_test.py::HighThroughputTest.test_decommission_and_add

@cjayani cjayani added the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label Feb 26, 2026
Comment thread tests/rptest/services/redpanda_installer.py
Comment thread tests/rptest/services/redpanda_installer.py
Comment thread tests/rptest/services/provider_clients/rpcloud_client.py
@claude

claude Bot commented Feb 26, 2026

Copy link
Copy Markdown

Code Review Summary

The PR replaces ad-hoc HTTP calls and manual retry loops with requests.Session + urllib3.Retry, which is a clean approach that centralizes retry logic. The overall direction is good. A few issues to address:

Issues

1. Missing status_forcelist in redpanda_installer.py (Bug / Behavioral Regression)
The _create_http_session() in redpanda_installer.py has no status_forcelist, so it only retries on connection-level errors (TCP failures, read errors), not on HTTP 5xx responses. The old manual retry loop in _avail_for_download retried on any non-allowed status code including 500/502/503/504. This is a regression — transient S3 5xx errors will no longer be retried. Add status_forcelist=[502, 503, 504] to match the old behavior (the rpcloud_client.py config already has this, so the inconsistency appears to be an oversight).

2. Significantly weaker backoff timing in redpanda_installer.py
The old retry used sleep(5.0 ** (4 - num_retries)) producing delays of 5, 25, 125 seconds. The new backoff_factor=1.0 produces 0, 1, 2 seconds. If the long delays were intentional for S3 propagation issues, a higher backoff_factor may be needed.

3. _released_versions_json still uses bare requests.get()
Line 407 of redpanda_installer.py still calls requests.get(url) directly instead of self._http_session.get(url). This paginated GitHub API call would also benefit from the retry logic, especially since the PR motivation mentions reducing intermittent RemoteDisconnected errors.

Minor

4. Non-idempotent methods in rpcloud_client.py retry config
allowed_methods includes POST/PATCH/DELETE. For a test client this is likely fine, but worth being aware that retrying non-idempotent requests on 502/503/504 could cause duplicate side effects.

What looks good

  • Using requests.Session with urllib3.Retry is the idiomatic way to handle this in Python — much cleaner than manual loops.
  • rpcloud_client.py has a well-configured retry with appropriate status_forcelist.
  • Restricting installer retries to HEAD/GET only is a safe choice.
  • The from time import sleep removal is correct since its only usage was in the deleted retry loop.

@cjayani

cjayani commented Feb 27, 2026

Copy link
Copy Markdown
Contributor Author

Code Review Summary

The PR replaces ad-hoc HTTP calls and manual retry loops with requests.Session + urllib3.Retry, which is a clean approach that centralizes retry logic. The overall direction is good. A few issues to address:

Issues

1. Missing status_forcelist in redpanda_installer.py (Bug / Behavioral Regression) The _create_http_session() in redpanda_installer.py has no status_forcelist, so it only retries on connection-level errors (TCP failures, read errors), not on HTTP 5xx responses. The old manual retry loop in _avail_for_download retried on any non-allowed status code including 500/502/503/504. This is a regression — transient S3 5xx errors will no longer be retried. Add status_forcelist=[502, 503, 504] to match the old behavior (the rpcloud_client.py config already has this, so the inconsistency appears to be an oversight).

2. Significantly weaker backoff timing in redpanda_installer.py The old retry used sleep(5.0 ** (4 - num_retries)) producing delays of 5, 25, 125 seconds. The new backoff_factor=1.0 produces 0, 1, 2 seconds. If the long delays were intentional for S3 propagation issues, a higher backoff_factor may be needed.

3. _released_versions_json still uses bare requests.get() Line 407 of redpanda_installer.py still calls requests.get(url) directly instead of self._http_session.get(url). This paginated GitHub API call would also benefit from the retry logic, especially since the PR motivation mentions reducing intermittent RemoteDisconnected errors.

Minor

4. Non-idempotent methods in rpcloud_client.py retry config allowed_methods includes POST/PATCH/DELETE. For a test client this is likely fine, but worth being aware that retrying non-idempotent requests on 502/503/504 could cause duplicate side effects.

What looks good

  • Using requests.Session with urllib3.Retry is the idiomatic way to handle this in Python — much cleaner than manual loops.
  • rpcloud_client.py has a well-configured retry with appropriate status_forcelist.
  • Restricting installer retries to HEAD/GET only is a safe choice.
  • The from time import sleep removal is correct since its only usage was in the deleted retry loop.

Regarding point 3. the scope of this change is specifically the RemoteDisconnected errors hitting _avail_for_download (S3 HEAD requests). The GitHub API call in _released_versions_json is a separate code path that wasn't part of the reported issue.

@rpdevmp

rpdevmp commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

Hi Chintan,

Thank you for addressing and adding comments to review. Please resolve all of the conversations since you adressed those.

I see that tests are passing:
https://buildkite.com/redpanda/redpanda/builds/81249#019ca0f3-1e30-4960-96ce-56648ac91212

But CI still shows as failed. We need to double check to make sure this would not brake other things

@rpdevmp rpdevmp left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thank you for resolving all comments and CI is now passing as well

@cjayani cjayani merged commit ae5f867 into dev Mar 3, 2026
20 checks passed
@cjayani cjayani deleted the DEVPROD-3638 branch March 3, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-review Adding this label to a PR will trigger a workflow to review the code using claude.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants