feat(channels/webhook): add retry logic with exponential backoff#5838
feat(channels/webhook): add retry logic with exponential backoff#5838mn13 wants to merge 7 commits into
Conversation
063a4f4 to
2111658
Compare
theonlyhennygod
left a comment
There was a problem hiding this comment.
Comprehension Summary
This PR adds retry logic with exponential backoff and jitter to WebhookChannel::send, closing #5761. When outbound webhook sends encounter transient failures (network errors, HTTP 429, or 5xx), the channel now retries up to max_retries times with configurable exponential backoff, honoring Retry-After headers on 429/503. Non-429 4xx responses still fail immediately. Three new optional fields on WebhookConfig (max_retries, retry_base_delay_ms, retry_max_delay_ms) make the behavior tunable, with max_retries=0 restoring fire-and-forget semantics. The blast radius is limited to the webhook channel's outbound send path; no other channels, the inbound listener, or shared infrastructure are affected.
What's Good
- Clean, well-structured implementation that follows the existing Notion channel retry pattern.
- Excellent test coverage: happy path, 5xx retry+success, 4xx no-retry, 429 exhaustion,
Retry-Afterhonor on both 429 and 503,max_retries=0opt-out, backoff cap, andparse_retry_after_msedge cases. - Proper use of
wiremockfor HTTP-level integration tests with request count assertions. - Config fields are
Optionwith#[serde(default)], preserving full backward compatibility. - Delay clamping to
>=1msprevents busy-retry on misconfigured zero values. - Documentation updated in both
enandvilocales. - All CI checks pass. All PR-introduced tests pass locally.
Security / Performance Assessment
- Security: No security impact identified. No new permissions, no credential handling changes, no attack surface expansion. The retry loop is bounded by
max_retriesandretry_max_delay_ms, preventing unbounded resource consumption. - Performance: The retry loop holds the async send task longer during transient failures, but
Channel::sendis already spawned viatokio::spawnat call sites, so this does not block the agent loop. Therand::random::<f64>()call for jitter is lightweight. No new allocations on the hot path beyond the existing HTTP request. No binary size concern -- no new dependencies added.
Findings
[blocking] PR template not filled out
The PR body contains the raw template placeholders. The Summary section bullets, Validation Evidence, Security & Privacy Impact, Compatibility, and Rollback sections are all unfilled -- they still show the template prompts rather than actual answers. Per the reviewer playbook (Section 3.1) and PR workflow (Section 5.1), the template must be fully completed before review proceeds.
Action: Fill out all required PR template sections with actual content. Specifically:
- Summary: Fill the "What changed and why" bullets, scope boundary, blast radius, and linked issues (the
Closes #5761is in the intro paragraph but not in the template field). - Validation Evidence: Paste the tail output of
cargo fmt,cargo clippy, andcargo testruns. - Security & Privacy Impact: Answer Yes/No for each item.
- Compatibility: Answer the backward-compatible and config surface questions.
- Rollback: Document the rollback path (e.g.,
max_retries = 0in config, orgit revert).
[blocking] Missing risk:* and size:* labels
The PR currently only has the docs label. It is missing risk:* and size:* labels. Based on the changes (555 additions across channels/webhook.rs, config/schema.rs, runtime/, onboard/wizard.rs, and docs), this should be labeled approximately size: M and risk: medium (behavior change in crates/zeroclaw-channels/src/**).
Action: Apply risk: medium and size: M labels via the sidebar.
[suggestion] Config test functions use #[test] async fn instead of #[tokio::test]
In crates/zeroclaw-config/src/schema.rs, the two new test functions webhook_config_retry_fields_default_to_none (line 13043) and webhook_config_retry_fields_roundtrip (line 13052) are marked #[test] async fn. These tests happen to work because they contain no .await calls, so the synchronous code in the async block executes before the future is returned and dropped. However, this is misleading -- either remove async (these are purely synchronous tests) or use #[tokio::test].
Action: Change async fn to fn on both test functions, since they perform no async operations.
[suggestion] zh-CN channels-reference not updated with retry documentation
The PR correctly updated docs/reference/api/channels-reference.md (en) and docs/i18n/vi/channels-reference.md (vi) with the new retry config fields and behavior notes. However, docs/i18n/zh-CN/reference/api/channels-reference.zh-CN.md also has a webhook section (around line 262-272) that was not updated. Per docs/contributing/docs-contract.md, localized runtime-contract docs should be updated where equivalents exist.
Action: Add the retry config fields and behavior documentation to the zh-CN channels-reference webhook section, mirroring the changes made to the en and vi versions.
[suggestion] incoming_payload_without_thread test uses non-project-scoped sender name
In crates/zeroclaw-channels/src/webhook.rs line 597, the test incoming_payload_without_thread uses "bob" as the sender name. Per docs/contributing/pr-discipline.md, test fixtures should use neutral, project-scoped placeholders like zeroclaw_user, user_a, or test_user.
Action: Replace "bob" with a project-scoped placeholder (e.g., "zeroclaw_user" or "user_a").
Verdict: Needs Author Action
Thank you @mn13 for this well-implemented feature. The code quality is high, test coverage is thorough, and the design follows established patterns in the codebase. The two blocking items are procedural (template completion and labels), and the three suggestions are minor. Once the template is filled and labels applied, this will be in good shape for a follow-up review pass.
Must fix before re-review:
- Complete all required PR template sections.
- Apply
risk: mediumandsize: Mlabels.
Should fix:
3. Remove async from the two #[test] config test functions.
4. Update zh-CN channels-reference with retry documentation.
5. Replace "bob" with a project-scoped test placeholder.
|
Thanks for the thorough pass, @theonlyhennygod β addressed below. Template filled. PR body now covers Summary/Scope/Blast radius, Validation Evidence (with literal tail output of Labels (
Commit: 22f1f2f on |
79ff0de to
d8ba2ab
Compare
|
@mn13 Can you do a formal review request for me so it gets in my queue? |
|
mind self-assigning? or should I flag a maintainer |
|
@singlerider friendly bump β CI's green and all 2026-04-18 review items are addressed (commit Labels. Review request. I can't set reviewers from the sidebar either, so the PR isn't in anyone's queue. Mind kicking off a formal request β either to yourself or whoever's appropriate? Still open for reviewer discretion: the |
a6e4a5a to
0bcbf3b
Compare
0bcbf3b to
d637a5c
Compare
d637a5c to
4ae0bf4
Compare
67e4cb6 to
e951944
Compare
Audacity88
left a comment
There was a problem hiding this comment.
I re-reviewed the current head e951944, the linked issue #5761, @theonlyhennygod's earlier changes-requested review, the author's follow-up comments, the current PR body, the current labels, green GitHub checks, and the changed source paths. I checked the retry/backoff logic in crates/zeroclaw-channels/src/webhook.rs, the new WebhookConfig fields in crates/zeroclaw-config/src/schema.rs, and the updated call sites. I did not rerun local cargo.
Because @theonlyhennygod's CHANGES_REQUESTED review is still active, I am posting this as a comment rather than approving over that block.
β Resolved β prior review items look addressed
The earlier procedural blockers look resolved from the current public state. The PR body is now filled in, risk: medium and size: M are applied, and the test fixture no longer uses the old personal-name placeholder.
The two disputed suggestions also look reasonably handled. The #[test] async fn shape follows the existing schema.rs test-module convention because that module imports tokio::test, and the stale docs/i18n/... channels-reference paths from the earlier review do not exist in the current repo layout. The authorβs current i18n section now explains that the retry fields are documented through schema doc-comments rather than a separate localized channels-reference page.
π’ What looks good β retry behavior is bounded and covered
The retry loop itself looks sound. max_retries = 0 makes exactly one attempt, final failures bail without sleeping again, non-429 4xx responses fail immediately, 429 and 5xx responses retry, and Retry-After is honored for 429/503 with a runtime cap.
The tests cover the load-bearing paths: defaults and overrides, 5xx then success, 4xx no-retry, 429 exhaustion, Retry-After for both 429 and 503, max_retries = 0, and config serde defaults/roundtrip. I do not see a code blocker in the retry/backoff mechanics.
π‘ Warning β PR body overstates send-task isolation
One PR-body sentence still reads too strongly: the Blast radius section says Channel::send is already spawned via tokio::spawn at call sites, so the agent loop is not blocked. The current public diff proves the retry loop is finite and bounded, but the body should avoid claiming universal caller isolation unless that is the contract being reviewed and maintained across call sites.
That does not make the retry loop unsafe. The retries are finite, delays are bounded, and extending the send lifetime is the featureβs main tradeoff. I would just tighten the body so it says the retry loop can extend the lifetime of the outbound send task/caller, while the configured retry count and delay cap bound that effect.
π‘ Warning β retry_max_delay_ms is not currently a strict max after jitter
The schema doc says retry_max_delay_ms is the maximum delay cap for any single retry wait, and the PR body says waits are capped by that value. In the exponential-backoff path, the implementation caps the base delay first and then applies Β±25% jitter, so an exponential-backoff wait can exceed the configured max by up to 25%. The code comment and test acknowledge this, but the user-facing config wording is stricter than the behavior.
I would either make the cap strict by applying jitter before the final min(retry_max_delay_ms), or adjust the public wording to say the cap is approximate before jitter. I do not think this needs to block the feature, but the config contract should not surprise operators.
No new code blockers from my pass. The remaining question is whether @theonlyhennygod wants to clear or update the earlier formal review.
e951944 to
df3bb21
Compare
β¦is strict Addresses Audacity88's 2026-05-13 warning on zeroclaw-labs#5838. The schema doc-comment promises `retry_max_delay_ms` is the maximum delay cap for any single retry wait, but `compute_backoff` capped the base delay first and then applied Β±25% jitter, so the returned wait could exceed the configured cap by up to 25%. Reorder to `apply_jitter(base).min(retry_max_delay_ms)` β jitter is now absorbed by the strict cap, and the public wording matches the behavior. Update the function doc-comment to state the cap is strict, tighten the existing `backoff_capped_by_max_delay` test to assert equality at the cap, and add `backoff_never_exceeds_max_delay_near_cap` (256-draw fuzz) covering the previously-broken case where base β cap allowed jitter to push the wait above the cap.
|
@Audacity88 β addressed both 2026-05-13 warnings on Strict
Blast-radius wording softened. The PR body no longer claims universal caller isolation. It now says the retry loop can extend the outbound send task/caller's lifetime (bounded by CI re-running. |
Audacity88
left a comment
There was a problem hiding this comment.
I re-reviewed the current head 5e572f1, the linked issue #5761, the prior @theonlyhennygod CHANGES_REQUESTED review, @mn13's follow-up replies, the current PR body, current labels, the current public check snapshot, and the changed source paths. I checked the retry loop in crates/zeroclaw-channels/src/webhook.rs, the WebhookConfig schema additions in crates/zeroclaw-config/src/schema.rs, the orchestrator wiring, and the updated runtime test call site. I did not rerun local cargo; I relied on the current PR CI plus source inspection for this pass.
Because @theonlyhennygod's earlier CHANGES_REQUESTED review is still active, I am keeping this as a comment rather than approving over that block.
β Resolved β May 13 warning on send-task isolation
The Blast radius wording is now accurate. It says the retry loop can extend the lifetime of the outbound send task/caller, and it treats existing tokio::spawn isolation as a caller-side contract instead of something this PR guarantees universally. That resolves my earlier concern about the PR body overstating the async isolation contract.
β
Resolved β May 13 warning on retry_max_delay_ms
The latest commit makes retry_max_delay_ms a strict cap in the exponential-backoff path. compute_backoff now applies jitter first and then clamps with min(retry_max_delay_ms), so the returned delay cannot exceed the configured cap. The updated backoff_capped_by_max_delay assertion and the new near-cap loop cover the case that previously could escape the cap.
π’ What looks good β retry behavior remains bounded
The retry behavior still looks bounded and targeted: max_retries = 0 preserves one-shot behavior, final failures do not sleep again, non-429 4xx responses fail immediately, transient network errors and 429/5xx responses retry, and numeric/delay-seconds Retry-After is honored for 429/503 with the runtime cap applied.
The config additions are optional with serde defaults, and the orchestrator passes the new fields through when constructing the webhook channel. I do not see a new webhook-code blocker from this pass.
π‘ Warning β current red Security check appears covered by #6662
The only red check I see on this PR is the Security job / CI Required Gate. I am not including the raw Security job log in this comment, but the available public #6657/#6662 context strongly points to that failure being the lettre 0.11.21 / RUSTSEC-2026-0141 advisory, not a dependency introduced by this webhook retry PR.
#6662 updates the workspace and channel crate manifest floor plus Cargo.lock to lettre 0.11.22, and its PR body reports cargo deny check passing after that update. So I would treat the current red Security status here as base-branch dependency fallout that should clear once #6662, or equivalent lettre remediation, lands and this branch is rebased or refreshed onto current master.
After #6662 lands and this PR gets a fresh green run, the remaining review-state question is whether @theonlyhennygod wants to clear or update the earlier formal review.
|
@theonlyhennygod β gentle ping for a re-review pass when you have time. Your 2026-04-18 Blocking items
Suggestions Two follow-up commits since your review also tightened Happy to address anything further you flag. |
67b3db9 to
b222221
Compare
β¦is strict Addresses Audacity88's 2026-05-13 warning on zeroclaw-labs#5838. The schema doc-comment promises `retry_max_delay_ms` is the maximum delay cap for any single retry wait, but `compute_backoff` capped the base delay first and then applied Β±25% jitter, so the returned wait could exceed the configured cap by up to 25%. Reorder to `apply_jitter(base).min(retry_max_delay_ms)` β jitter is now absorbed by the strict cap, and the public wording matches the behavior. Update the function doc-comment to state the cap is strict, tighten the existing `backoff_capped_by_max_delay` test to assert equality at the cap, and add `backoff_never_exceeds_max_delay_near_cap` (256-draw fuzz) covering the previously-broken case where base β cap allowed jitter to push the wait above the cap.
β¦utbound sends Closes zeroclaw-labs#5761. WebhookChannel::send now retries transient failures (network errors, 429, 5xx) with exponential backoff + jitter, honoring Retry-After on 429/503 and capping waits by retry_max_delay_ms. Non-429 4xx responses still fail immediately. Tunable via three optional WebhookConfig fields (max_retries=3, retry_base_delay_ms=500, retry_max_delay_ms=30000); set max_retries=0 to keep the previous fire-and-forget behavior.
Post-review cleanup on dd93422: - Drop dead `last_err` state and replace unreachable tail `bail!` with `unreachable!` β the loop always exits via return/bail on the final attempt, so the trailing error was never reached. - Expand comment above the 429/503 Retry-After branch to explain why 429 appears twice: once here (server-supplied delay) and once in the fallback exponential-backoff branch when no Retry-After header was sent. - Note in `compute_backoff` doc that the `retry_max_delay_ms` cap is approximate β Β±25% jitter is applied after capping. - Add `send_honors_retry_after_on_503` wiremock test mirroring the 429 case, closing a gap where only 429+Retry-After was covered. - Add schema tests for the 3 new `WebhookConfig` retry fields: `webhook_config_retry_fields_default_to_none` and `webhook_config_retry_fields_roundtrip` (JSON + TOML).
After the rebase onto master, two WebhookConfig { ... } struct literals
added in fe3ec58 (zeroclaw-labs#5799) no longer compile without the three retry
fields introduced in dd93422. Both are test-only call-sites, so `None`
is correct for each.
- Defer reading the response body until after the Retry-After early return. Hot 429/503 loops against servers that echo large error pages no longer pay the text() I/O cost per attempt. - Merge the two `impl WebhookChannel` blocks so `attempt_send` lives alongside the other helpers in a single block. - Clamp `retry_base_delay_ms` and `retry_max_delay_ms` to a 1ms minimum in `WebhookChannel::new` so a misconfigured `0` cannot busy-retry without yielding. Documented on the schema fields.
Per `docs/contributing/pr-discipline.md`, test fixtures should use neutral, project-scoped placeholders. Replace `bob` with `zeroclaw_user` in `incoming_payload_without_thread`.
β¦is strict Addresses Audacity88's 2026-05-13 warning on zeroclaw-labs#5838. The schema doc-comment promises `retry_max_delay_ms` is the maximum delay cap for any single retry wait, but `compute_backoff` capped the base delay first and then applied Β±25% jitter, so the returned wait could exceed the configured cap by up to 25%. Reorder to `apply_jitter(base).min(retry_max_delay_ms)` β jitter is now absorbed by the strict cap, and the public wording matches the behavior. Update the function doc-comment to state the cap is strict, tighten the existing `backoff_capped_by_max_delay` test to assert equality at the cap, and add `backoff_never_exceeds_max_delay_near_cap` (256-draw fuzz) covering the previously-broken case where base β cap allowed jitter to push the wait above the cap.
7f88441 to
1f91920
Compare
WareWolf-MoonWall
left a comment
There was a problem hiding this comment.
I read the full diff at current head (including the jitter-cap tightening in 5e572f1 and the announcement-wiring in 67b3db9), the PR body, the linked issue #5761, @theonlyhennygod's CHANGES_REQUESTED review (on 2111658), @Audacity88's two re-review comments (on e951944 and 5e572f1), and @mn13's latest ping. I also read FND-006 Β§4.1 and Β§4.3.
State check. @theonlyhennygod holds an active CHANGES_REQUESTED. I have no blocking items of my own; my findings below are acknowledgements and suggestions. Per protocol I am posting as a comment.
β Resolved β PR template, labels, and fixture placeholder
All five items from @theonlyhennygod's original review are accounted for in the current head: the PR body is fully completed, risk: medium and size: M labels are applied, the "bob" fixture is now "zeroclaw_user" (df3bb21), the #[test] async fn shape follows the schema.rs module convention where tokio::test is already in scope, and the disputed zh-CN channels-reference path does not exist in the current repo layout.
β
Resolved β retry_max_delay_ms cap strictness
@Audacity88's 2026-05-13 warning about jitter potentially exceeding the configured cap is addressed in 5e572f1. compute_backoff now applies jitter to the uncapped base delay and then clamps the jittered value with .min(self.retry_max_delay_ms), making the cap strict. The backoff_never_exceeds_max_delay_near_cap test runs 256 draws and asserts no draw exceeds the cap β the correct way to confirm this.
β Resolved β Blast radius wording
The PR body sentence about tokio::spawn caller isolation is now accurate: it presents the isolation as a caller-side contract rather than something this PR guarantees universally. @Audacity88's 2026-05-14 comment confirms this. No further action needed there.
π’ What looks good β Error discrimination is correct
Non-429 4xx responses β AttemptOutcome::Fatal (no retry). 429 and 5xx β AttemptOutcome::Retry (exponential backoff). 429/503 with Retry-After β AttemptOutcome::RetryAfter (server-supplied delay, capped). The test suite covers all four branches including Retry-After exhaustion. This is the right behavioral contract for a well-behaved HTTP client.
π’ What looks good β Bounds and defaults
max_retries.saturating_add(1) prevents the loop from running zero times when max_retries = 0 (one attempt, no sleep, which is the intended "restore prior behavior" semantics). The >=1ms clamp on delays prevents a misconfigured zero from busy-retrying without yielding. The three config fields are all Option with #[serde(default)], preserving backward compatibility for every existing deployment.
π’ What looks good β unreachable!() at loop tail
The unreachable!("send loop exits via return or bail on the final attempt") is correct usage per FND-006 Β§4.1. The loop invariant makes it logically impossible to fall through, and the message documents exactly why. This is the difference between documented judgment and deferred judgment.
π΅ Suggestion β #[allow(clippy::too_many_arguments)] on WebhookChannel::new
WebhookChannel::new now takes nine parameters. Suppressing clippy::too_many_arguments hides a real signal: the constructor is a candidate for a WebhookChannelConfig builder or a struct-literal approach, which would also make call sites in the orchestrator and delivery path easier to read. This is not blocking β the current shape is functional and the annotation is honest about what it does β but consider whether a config struct would be cleaner before this goes in, especially since the config fields already have a natural grouping in WebhookConfig.
π΅ Suggestion β #[allow(clippy::cast_possible_truncation, clippy::cast_sign_loss)] in apply_jitter
The cast from f64 to u64 in apply_jitter is safe here because delay_ms is bounded by retry_max_delay_ms (a u64) before jitter is applied, and the jitter factor [0.75, 1.25] cannot produce a value that overflows a u64 at any realistic delay. A one-line inline comment explaining that bound would make the suppression self-documenting for future readers: // Safe: delay_ms <= retry_max_delay_ms (u64); jitter factor [0.75, 1.25] cannot overflow. cast_possible_truncation was already present in this file at line 222; I am only noting the value of documenting the invariant.
Summary for @theonlyhennygod. From my read of the current head, all five items from your original review appear resolved. The two suggestions above are mine alone and are non-blocking. This PR is waiting on your re-review pass; once you update or dismiss your CHANGES_REQUESTED, it is ready to move forward.
Milestone note. feat(channels/webhook): maps to the channels scope. Flagging for @JordanTheJet to confirm milestone assignment.

Closes #5761.
WebhookChannel::send now retries transient failures (network errors, 429, 5xx) with exponential backoff + jitter, honoring Retry-After on 429/503 and capping waits by retry_max_delay_ms. Non-429 4xx responses still fail immediately. Tunable via three optional WebhookConfig fields (max_retries=3, retry_base_delay_ms=500, retry_max_delay_ms=30000); set max_retries=0 to keep the previous fire-and-forget behavior.
Summary
mastermax_retries,retry_base_delay_ms,retry_max_delay_mstoWebhookConfig(allOption,#[serde(default)]) so operators can tune or disable the behavior per-deployment.Retry-Afteris honored on 429/503 (integer seconds, clamped byretry_max_delay_ms) to avoid hammering rate-limited endpoints.max_retries = 0preserves the prior fire-and-forget contract byte-for-byte for anyone who prefers it.WebhookChannel::sendpath and its config surface.max_retriesandretry_max_delay_mscap bound that effect. Most existing call sites already wrapChannel::sendintokio::spawn, so they keep the agent loop off the wait β but that isolation is a caller-side contract, not something this PR guarantees universally.Validation Evidence (required)
wiremock-backed integration tests (happy path, 5xx-then-success, 4xx no-retry, 429 exhaustion,Retry-Afteron both 429 and 503,max_retries=0opt-out, backoff cap). Not verified: live delivery against a real upstream webhook receiver β mocked only.cargo testis redundant here since changes are confined tozeroclaw-channels::webhookand the three newWebhookConfigfields inzeroclaw-config; ran targeted suites for both plus fmt/clippy across all targets.Security & Privacy Impact (required)
NoNoβ same HTTP POST that already existed, just retried on transient failures.NoNoβ test fixture renamed from"bob"to"zeroclaw_user"to satisfy the project-scoped-placeholder rule.Yes, describe the risk and mitigation: N.A.Compatibility (required)
YesYesβ three newOptionfields onWebhookConfig(max_retries,retry_base_delay_ms,retry_max_delay_ms); all default toNoneand use#[serde(default)].max_retries=3,retry_base_delay_ms=500,retry_max_delay_ms=30000. To restore the previous single-shot behavior explicitly, setmax_retries = 0.Rollback (required for
risk: mediumandrisk: high)max_retries = 0in[channels_config.webhook]to disable retries without redeploying. Hard rollback:git revert <merge-sha>.max_retries = 0is the runtime kill-switch.WebhookChannel::send(tracing targetzeroclaw_channels::webhook). Watch for unusual tail latency on the sender task β a pathological combination ofmax_retriesandretry_max_delay_mscould stretch a single send to tens of seconds.i18n Follow-Through (required only when docs or user-facing wording change)
This PR adds no doc changes β
git diff origin/master..HEAD --stat -- docs/is empty. Retry tunables and semantics are documented inline as///doc-comments onWebhookConfigfields incrates/zeroclaw-config/src/schema.rs(lines 7480β7491), which feed--print-config/zc config showand the JSON schema export.README*,docs/README*, anddocs/SUMMARY.mdfor supported locales?N.A.β no doc pages added or renamed.N.A.β no equivalent files exist in this repo. The book uses mdbook with.po-file translations (docs/book/po/{es,fr,ja,zh-CN}.po), not separatedocs/i18n/<locale>/...doc trees;find docs -name "*channels-reference*"returns no results anddocs/i18n/does not exist. The original review's referenced paths (docs/reference/api/channels-reference.md,docs/i18n/vi/...,docs/i18n/zh-CN/...) are stale from a prior repo layout.docs/i18n/vi/**synced and compatibility shims underdocs/*.vi.mdvalidated?N.A.βdocs/i18n/vi/does not exist in this repo.N.A., explain scope decision: This PR is a behavior-and-config change with no user-facing wording changes; the doc surface is the schema doc-comments. The end-user-facing book entry atdocs/book/src/channels/webhook.mdis intentionally unchanged here β it currently describes a different config layout ([channels.webhooks.<name>]plural) than the schema ([channels.webhook]singular), and reconciling that pre-existing drift is out of scope for this retry-logic PR.