Skip to content

feat(channels/webhook): add retry logic with exponential backoff#5838

Open
mn13 wants to merge 7 commits into
zeroclaw-labs:masterfrom
mn13:feat/webhook-retry-logic
Open

feat(channels/webhook): add retry logic with exponential backoff#5838
mn13 wants to merge 7 commits into
zeroclaw-labs:masterfrom
mn13:feat/webhook-retry-logic

Conversation

@mn13
Copy link
Copy Markdown
Contributor

@mn13 mn13 commented Apr 17, 2026

Closes #5761.

WebhookChannel::send now retries transient failures (network errors, 429, 5xx) with exponential backoff + jitter, honoring Retry-After on 429/503 and capping waits by retry_max_delay_ms. Non-429 4xx responses still fail immediately. Tunable via three optional WebhookConfig fields (max_retries=3, retry_base_delay_ms=500, retry_max_delay_ms=30000); set max_retries=0 to keep the previous fire-and-forget behavior.

Summary

  • Base branch: master
  • What changed and why:
    • Outbound webhook sends silently dropped on transient upstream blips; added a bounded retry loop (exponential backoff + Β±25% jitter) so users get at-least-one-retry semantics without rewiring call sites.
    • Added max_retries, retry_base_delay_ms, retry_max_delay_ms to WebhookConfig (all Option, #[serde(default)]) so operators can tune or disable the behavior per-deployment.
    • Retry-After is honored on 429/503 (integer seconds, clamped by retry_max_delay_ms) to avoid hammering rate-limited endpoints.
    • max_retries = 0 preserves the prior fire-and-forget contract byte-for-byte for anyone who prefers it.
  • Scope boundary: Does not touch inbound webhook handling, other channels, or the gateway transport β€” only the outbound WebhookChannel::send path and its config surface.
  • Blast radius: Webhook outbound users only. The retry loop can extend the lifetime of the outbound send task/caller; the configured max_retries and retry_max_delay_ms cap bound that effect. Most existing call sites already wrap Channel::send in tokio::spawn, so they keep the agent loop off the wait β€” but that isolation is a caller-side contract, not something this PR guarantees universally.
  • Linked issue(s): Closes [Feature]: Add retry logic with exponential backoff to webhook channel outbound sendsΒ #5761.

Validation Evidence (required)

cargo fmt --all -- --check
cargo clippy --all-targets -- -D warnings
cargo test -p zeroclaw-channels --lib webhook::
cargo test -p zeroclaw-config --lib webhook_config
bash scripts/ci/docs_quality_gate.sh
  • Commands run and tail output:
$ cargo fmt --all -- --check
(clean β€” no output)

$ cargo clippy --all-targets -- -D warnings
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 57.04s

$ cargo test -p zeroclaw-channels --lib webhook::
test result: ok. 27 passed; 0 failed; 0 ignored; 0 measured; 914 filtered out; finished in 1.04s

$ cargo test -p zeroclaw-config --lib webhook_config
test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 551 filtered out; finished in 0.02s

$ bash scripts/ci/docs_quality_gate.sh
No blocking markdown issues on changed lines.
  • Beyond CI β€” what did you manually verify? Exercised the new behavior with wiremock-backed integration tests (happy path, 5xx-then-success, 4xx no-retry, 429 exhaustion, Retry-After on both 429 and 503, max_retries=0 opt-out, backoff cap). Not verified: live delivery against a real upstream webhook receiver β€” mocked only.
  • If any command was intentionally skipped, why: Workspace-wide cargo test is redundant here since changes are confined to zeroclaw-channels::webhook and the three new WebhookConfig fields in zeroclaw-config; ran targeted suites for both plus fmt/clippy across all targets.

Security & Privacy Impact (required)

  • New permissions, capabilities, or file system access scope? No
  • New external network calls? No β€” same HTTP POST that already existed, just retried on transient failures.
  • Secrets / tokens / credentials handling changed? No
  • PII, real identities, or personal data in diff, tests, fixtures, or docs? No β€” test fixture renamed from "bob" to "zeroclaw_user" to satisfy the project-scoped-placeholder rule.
  • If any Yes, describe the risk and mitigation: N.A.

Compatibility (required)

  • Backward compatible? Yes
  • Config / env / CLI surface changed? Yes β€” three new Option fields on WebhookConfig (max_retries, retry_base_delay_ms, retry_max_delay_ms); all default to None and use #[serde(default)].
  • Upgrade steps: none. Existing configs continue to work; defaults are max_retries=3, retry_base_delay_ms=500, retry_max_delay_ms=30000. To restore the previous single-shot behavior explicitly, set max_retries = 0.

Rollback (required for risk: medium and risk: high)

  • Fast rollback command/path: Set max_retries = 0 in [channels_config.webhook] to disable retries without redeploying. Hard rollback: git revert <merge-sha>.
  • Feature flags or config toggles: max_retries = 0 is the runtime kill-switch.
  • Observable failure symptoms: Sustained webhook delivery failures would show as elevated error logs from WebhookChannel::send (tracing target zeroclaw_channels::webhook). Watch for unusual tail latency on the sender task β€” a pathological combination of max_retries and retry_max_delay_ms could stretch a single send to tens of seconds.

i18n Follow-Through (required only when docs or user-facing wording change)

This PR adds no doc changes β€” git diff origin/master..HEAD --stat -- docs/ is empty. Retry tunables and semantics are documented inline as /// doc-comments on WebhookConfig fields in crates/zeroclaw-config/src/schema.rs (lines 7480–7491), which feed --print-config / zc config show and the JSON schema export.

  • Locale navigation parity updated in README*, docs/README*, and docs/SUMMARY.md for supported locales? N.A. β€” no doc pages added or renamed.
  • Localized runtime-contract docs updated where equivalents exist? N.A. β€” no equivalent files exist in this repo. The book uses mdbook with .po-file translations (docs/book/po/{es,fr,ja,zh-CN}.po), not separate docs/i18n/<locale>/... doc trees; find docs -name "*channels-reference*" returns no results and docs/i18n/ does not exist. The original review's referenced paths (docs/reference/api/channels-reference.md, docs/i18n/vi/..., docs/i18n/zh-CN/...) are stale from a prior repo layout.
  • Vietnamese canonical docs under docs/i18n/vi/** synced and compatibility shims under docs/*.vi.md validated? N.A. β€” docs/i18n/vi/ does not exist in this repo.
  • If any N.A., explain scope decision: This PR is a behavior-and-config change with no user-facing wording changes; the doc surface is the schema doc-comments. The end-user-facing book entry at docs/book/src/channels/webhook.md is intentionally unchanged here β€” it currently describes a different config layout ([channels.webhooks.<name>] plural) than the schema ([channels.webhook] singular), and reconciling that pre-existing drift is out of scope for this retry-logic PR.

@github-actions github-actions Bot added the docs Auto scope: docs/markdown/template files changed. label Apr 17, 2026
@mn13 mn13 force-pushed the feat/webhook-retry-logic branch 2 times, most recently from 063a4f4 to 2111658 Compare April 17, 2026 15:01
@mn13
Copy link
Copy Markdown
Contributor Author

mn13 commented Apr 17, 2026

@singlerider

@theonlyhennygod theonlyhennygod self-assigned this Apr 18, 2026
Copy link
Copy Markdown
Collaborator

@theonlyhennygod theonlyhennygod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comprehension Summary

This PR adds retry logic with exponential backoff and jitter to WebhookChannel::send, closing #5761. When outbound webhook sends encounter transient failures (network errors, HTTP 429, or 5xx), the channel now retries up to max_retries times with configurable exponential backoff, honoring Retry-After headers on 429/503. Non-429 4xx responses still fail immediately. Three new optional fields on WebhookConfig (max_retries, retry_base_delay_ms, retry_max_delay_ms) make the behavior tunable, with max_retries=0 restoring fire-and-forget semantics. The blast radius is limited to the webhook channel's outbound send path; no other channels, the inbound listener, or shared infrastructure are affected.

What's Good

  • Clean, well-structured implementation that follows the existing Notion channel retry pattern.
  • Excellent test coverage: happy path, 5xx retry+success, 4xx no-retry, 429 exhaustion, Retry-After honor on both 429 and 503, max_retries=0 opt-out, backoff cap, and parse_retry_after_ms edge cases.
  • Proper use of wiremock for HTTP-level integration tests with request count assertions.
  • Config fields are Option with #[serde(default)], preserving full backward compatibility.
  • Delay clamping to >=1ms prevents busy-retry on misconfigured zero values.
  • Documentation updated in both en and vi locales.
  • All CI checks pass. All PR-introduced tests pass locally.

Security / Performance Assessment

  • Security: No security impact identified. No new permissions, no credential handling changes, no attack surface expansion. The retry loop is bounded by max_retries and retry_max_delay_ms, preventing unbounded resource consumption.
  • Performance: The retry loop holds the async send task longer during transient failures, but Channel::send is already spawned via tokio::spawn at call sites, so this does not block the agent loop. The rand::random::<f64>() call for jitter is lightweight. No new allocations on the hot path beyond the existing HTTP request. No binary size concern -- no new dependencies added.

Findings

[blocking] PR template not filled out

The PR body contains the raw template placeholders. The Summary section bullets, Validation Evidence, Security & Privacy Impact, Compatibility, and Rollback sections are all unfilled -- they still show the template prompts rather than actual answers. Per the reviewer playbook (Section 3.1) and PR workflow (Section 5.1), the template must be fully completed before review proceeds.

Action: Fill out all required PR template sections with actual content. Specifically:

  1. Summary: Fill the "What changed and why" bullets, scope boundary, blast radius, and linked issues (the Closes #5761 is in the intro paragraph but not in the template field).
  2. Validation Evidence: Paste the tail output of cargo fmt, cargo clippy, and cargo test runs.
  3. Security & Privacy Impact: Answer Yes/No for each item.
  4. Compatibility: Answer the backward-compatible and config surface questions.
  5. Rollback: Document the rollback path (e.g., max_retries = 0 in config, or git revert).

[blocking] Missing risk:* and size:* labels

The PR currently only has the docs label. It is missing risk:* and size:* labels. Based on the changes (555 additions across channels/webhook.rs, config/schema.rs, runtime/, onboard/wizard.rs, and docs), this should be labeled approximately size: M and risk: medium (behavior change in crates/zeroclaw-channels/src/**).

Action: Apply risk: medium and size: M labels via the sidebar.

[suggestion] Config test functions use #[test] async fn instead of #[tokio::test]

In crates/zeroclaw-config/src/schema.rs, the two new test functions webhook_config_retry_fields_default_to_none (line 13043) and webhook_config_retry_fields_roundtrip (line 13052) are marked #[test] async fn. These tests happen to work because they contain no .await calls, so the synchronous code in the async block executes before the future is returned and dropped. However, this is misleading -- either remove async (these are purely synchronous tests) or use #[tokio::test].

Action: Change async fn to fn on both test functions, since they perform no async operations.

[suggestion] zh-CN channels-reference not updated with retry documentation

The PR correctly updated docs/reference/api/channels-reference.md (en) and docs/i18n/vi/channels-reference.md (vi) with the new retry config fields and behavior notes. However, docs/i18n/zh-CN/reference/api/channels-reference.zh-CN.md also has a webhook section (around line 262-272) that was not updated. Per docs/contributing/docs-contract.md, localized runtime-contract docs should be updated where equivalents exist.

Action: Add the retry config fields and behavior documentation to the zh-CN channels-reference webhook section, mirroring the changes made to the en and vi versions.

[suggestion] incoming_payload_without_thread test uses non-project-scoped sender name

In crates/zeroclaw-channels/src/webhook.rs line 597, the test incoming_payload_without_thread uses "bob" as the sender name. Per docs/contributing/pr-discipline.md, test fixtures should use neutral, project-scoped placeholders like zeroclaw_user, user_a, or test_user.

Action: Replace "bob" with a project-scoped placeholder (e.g., "zeroclaw_user" or "user_a").

Verdict: Needs Author Action

Thank you @mn13 for this well-implemented feature. The code quality is high, test coverage is thorough, and the design follows established patterns in the codebase. The two blocking items are procedural (template completion and labels), and the three suggestions are minor. Once the template is filled and labels applied, this will be in good shape for a follow-up review pass.

Must fix before re-review:

  1. Complete all required PR template sections.
  2. Apply risk: medium and size: M labels.

Should fix:
3. Remove async from the two #[test] config test functions.
4. Update zh-CN channels-reference with retry documentation.
5. Replace "bob" with a project-scoped test placeholder.

@mn13
Copy link
Copy Markdown
Contributor Author

mn13 commented Apr 18, 2026

Thanks for the thorough pass, @theonlyhennygod β€” addressed below.

Template filled. PR body now covers Summary/Scope/Blast radius, Validation Evidence (with literal tail output of fmt --check, clippy -D warnings, the two targeted cargo test suites, and the docs gate), Security/Privacy, Compatibility, Rollback, and i18n follow-through. Closes #5761 also lifted into the template's Linked issues field.

Labels (risk: medium, size: M). I tried gh pr edit --add-label but got mn13 does not have the correct permissions to execute AddLabelsToLabelable β€” could a maintainer apply these? Happy with risk: medium / size: M as the reviewer classification.

"bob" fixture β†’ renamed to "zeroclaw_user" in crates/zeroclaw-channels/src/webhook.rs (incoming_payload_without_thread, both the JSON input and the assertion).

zh-CN channels-reference. Retry tunables comment block and the five-bullet behavior notes added to docs/i18n/zh-CN/reference/api/channels-reference.zh-CN.md Β§4.8, mirroring the existing en and vi copies.

#[test] async fn on the two config tests β€” keeping as-is. The tests module at crates/zeroclaw-config/src/schema.rs:11212 has use tokio::test;, which shadows std's #[test]. So #[test] async fn in that module already is #[tokio::test] β€” dropping async fails to compile (error: the async keyword is missing from the function declaration). Every other test in that module follows the same #[test] async fn convention, so converting only these two to explicit #[tokio::test] would break the module's consistency rather than improve readability. Let me know if you'd prefer I migrate the whole module off the shadowing import as a follow-up β€” that's a bigger scope change than this PR warrants.

Commit: 22f1f2f on feat/webhook-retry-logic.

@mn13 mn13 requested a review from theonlyhennygod April 18, 2026 14:14
@mn13
Copy link
Copy Markdown
Contributor Author

mn13 commented Apr 20, 2026

@singlerider

@mn13 mn13 force-pushed the feat/webhook-retry-logic branch from 79ff0de to d8ba2ab Compare April 20, 2026 09:06
@singlerider
Copy link
Copy Markdown
Collaborator

@mn13 Can you do a formal review request for me so it gets in my queue?

@mn13
Copy link
Copy Markdown
Contributor Author

mn13 commented Apr 20, 2026

image I would be happy but I can't

@mn13
Copy link
Copy Markdown
Contributor Author

mn13 commented Apr 20, 2026

mind self-assigning? or should I flag a maintainer

@singlerider singlerider added size: M Auto size: 251-500 non-doc changed lines. risk: medium Auto risk: src/** or dependency/config changes. labels Apr 22, 2026
@singlerider singlerider self-assigned this Apr 22, 2026
@mn13
Copy link
Copy Markdown
Contributor Author

mn13 commented Apr 23, 2026

@singlerider friendly bump β€” CI's green and all 2026-04-18 review items are addressed (commit 22f1f2f7). Two things still need maintainer hands:

Labels. risk: medium + size: M per @theonlyhennygod's call β€” gh pr edit --add-label still returns mn13 does not have the correct permissions to execute AddLabelsToLabelable.

Review request. I can't set reviewers from the sidebar either, so the PR isn't in anyone's queue. Mind kicking off a formal request β€” either to yourself or whoever's appropriate?

Still open for reviewer discretion: the #[test] async fn pushback above β€” use tokio::test; at crates/zeroclaw-config/src/schema.rs:11212 shadows #[test] module-wide, so those two tests already are #[tokio::test] and dropping async fails to compile. Happy to migrate the whole module off the shadowing import as a follow-up, but that's outside this PR's scope. No action needed unless @theonlyhennygod wants to weigh in.

@mn13 mn13 force-pushed the feat/webhook-retry-logic branch 2 times, most recently from a6e4a5a to 0bcbf3b Compare April 25, 2026 11:50
@singlerider singlerider added enhancement New feature or request channel Auto scope: src/channels/** changed. needs-author-action Author action required before merge labels Apr 29, 2026
@mn13 mn13 force-pushed the feat/webhook-retry-logic branch from 0bcbf3b to d637a5c Compare May 4, 2026 08:00
@github-actions github-actions Bot removed docs Auto scope: docs/markdown/template files changed. channel Auto scope: src/channels/** changed. labels May 4, 2026
@mn13 mn13 force-pushed the feat/webhook-retry-logic branch from d637a5c to 4ae0bf4 Compare May 4, 2026 12:42
@mn13 mn13 force-pushed the feat/webhook-retry-logic branch from 67e4cb6 to e951944 Compare May 13, 2026 15:00
Copy link
Copy Markdown
Collaborator

@Audacity88 Audacity88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-reviewed the current head e951944, the linked issue #5761, @theonlyhennygod's earlier changes-requested review, the author's follow-up comments, the current PR body, the current labels, green GitHub checks, and the changed source paths. I checked the retry/backoff logic in crates/zeroclaw-channels/src/webhook.rs, the new WebhookConfig fields in crates/zeroclaw-config/src/schema.rs, and the updated call sites. I did not rerun local cargo.

Because @theonlyhennygod's CHANGES_REQUESTED review is still active, I am posting this as a comment rather than approving over that block.

βœ… Resolved β€” prior review items look addressed

The earlier procedural blockers look resolved from the current public state. The PR body is now filled in, risk: medium and size: M are applied, and the test fixture no longer uses the old personal-name placeholder.

The two disputed suggestions also look reasonably handled. The #[test] async fn shape follows the existing schema.rs test-module convention because that module imports tokio::test, and the stale docs/i18n/... channels-reference paths from the earlier review do not exist in the current repo layout. The author’s current i18n section now explains that the retry fields are documented through schema doc-comments rather than a separate localized channels-reference page.

🟒 What looks good β€” retry behavior is bounded and covered

The retry loop itself looks sound. max_retries = 0 makes exactly one attempt, final failures bail without sleeping again, non-429 4xx responses fail immediately, 429 and 5xx responses retry, and Retry-After is honored for 429/503 with a runtime cap.

The tests cover the load-bearing paths: defaults and overrides, 5xx then success, 4xx no-retry, 429 exhaustion, Retry-After for both 429 and 503, max_retries = 0, and config serde defaults/roundtrip. I do not see a code blocker in the retry/backoff mechanics.

🟑 Warning β€” PR body overstates send-task isolation

One PR-body sentence still reads too strongly: the Blast radius section says Channel::send is already spawned via tokio::spawn at call sites, so the agent loop is not blocked. The current public diff proves the retry loop is finite and bounded, but the body should avoid claiming universal caller isolation unless that is the contract being reviewed and maintained across call sites.

That does not make the retry loop unsafe. The retries are finite, delays are bounded, and extending the send lifetime is the feature’s main tradeoff. I would just tighten the body so it says the retry loop can extend the lifetime of the outbound send task/caller, while the configured retry count and delay cap bound that effect.

🟑 Warning β€” retry_max_delay_ms is not currently a strict max after jitter

The schema doc says retry_max_delay_ms is the maximum delay cap for any single retry wait, and the PR body says waits are capped by that value. In the exponential-backoff path, the implementation caps the base delay first and then applies Β±25% jitter, so an exponential-backoff wait can exceed the configured max by up to 25%. The code comment and test acknowledge this, but the user-facing config wording is stricter than the behavior.

I would either make the cap strict by applying jitter before the final min(retry_max_delay_ms), or adjust the public wording to say the cap is approximate before jitter. I do not think this needs to block the feature, but the config contract should not surprise operators.

No new code blockers from my pass. The remaining question is whether @theonlyhennygod wants to clear or update the earlier formal review.

@Audacity88 Audacity88 removed the needs-author-action Author action required before merge label May 13, 2026
@mn13 mn13 force-pushed the feat/webhook-retry-logic branch from e951944 to df3bb21 Compare May 14, 2026 07:24
mn13 added a commit to mn13/zeroclaw that referenced this pull request May 14, 2026
…is strict

Addresses Audacity88's 2026-05-13 warning on zeroclaw-labs#5838.

The schema doc-comment promises `retry_max_delay_ms` is the maximum delay
cap for any single retry wait, but `compute_backoff` capped the base delay
first and then applied Β±25% jitter, so the returned wait could exceed the
configured cap by up to 25%.

Reorder to `apply_jitter(base).min(retry_max_delay_ms)` β€” jitter is now
absorbed by the strict cap, and the public wording matches the behavior.
Update the function doc-comment to state the cap is strict, tighten the
existing `backoff_capped_by_max_delay` test to assert equality at the cap,
and add `backoff_never_exceeds_max_delay_near_cap` (256-draw fuzz) covering
the previously-broken case where base β‰ˆ cap allowed jitter to push the
wait above the cap.
@mn13
Copy link
Copy Markdown
Contributor Author

mn13 commented May 14, 2026

@Audacity88 β€” addressed both 2026-05-13 warnings on 5e572f1a2.

Strict retry_max_delay_ms cap. Reordered compute_backoff to apply_jitter(base).min(retry_max_delay_ms) so jitter is absorbed by the cap rather than escaping it. The function doc-comment now states the cap is strict, and the cap tests are tightened:

  • backoff_capped_by_max_delay asserts equality at the cap (attempt=10 with base >> cap).
  • New backoff_never_exceeds_max_delay_near_cap fuzzes 256 draws at attempt=1 where base β‰ˆ cap β€” the case where jitter previously could push the wait above the cap.

Blast-radius wording softened. The PR body no longer claims universal caller isolation. It now says the retry loop can extend the outbound send task/caller's lifetime (bounded by max_retries and retry_max_delay_ms), and that the tokio::spawn wrap at existing call sites is a caller-side contract rather than something this PR guarantees.

CI re-running.

Copy link
Copy Markdown
Collaborator

@Audacity88 Audacity88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-reviewed the current head 5e572f1, the linked issue #5761, the prior @theonlyhennygod CHANGES_REQUESTED review, @mn13's follow-up replies, the current PR body, current labels, the current public check snapshot, and the changed source paths. I checked the retry loop in crates/zeroclaw-channels/src/webhook.rs, the WebhookConfig schema additions in crates/zeroclaw-config/src/schema.rs, the orchestrator wiring, and the updated runtime test call site. I did not rerun local cargo; I relied on the current PR CI plus source inspection for this pass.

Because @theonlyhennygod's earlier CHANGES_REQUESTED review is still active, I am keeping this as a comment rather than approving over that block.

βœ… Resolved β€” May 13 warning on send-task isolation

The Blast radius wording is now accurate. It says the retry loop can extend the lifetime of the outbound send task/caller, and it treats existing tokio::spawn isolation as a caller-side contract instead of something this PR guarantees universally. That resolves my earlier concern about the PR body overstating the async isolation contract.

βœ… Resolved β€” May 13 warning on retry_max_delay_ms

The latest commit makes retry_max_delay_ms a strict cap in the exponential-backoff path. compute_backoff now applies jitter first and then clamps with min(retry_max_delay_ms), so the returned delay cannot exceed the configured cap. The updated backoff_capped_by_max_delay assertion and the new near-cap loop cover the case that previously could escape the cap.

🟒 What looks good β€” retry behavior remains bounded

The retry behavior still looks bounded and targeted: max_retries = 0 preserves one-shot behavior, final failures do not sleep again, non-429 4xx responses fail immediately, transient network errors and 429/5xx responses retry, and numeric/delay-seconds Retry-After is honored for 429/503 with the runtime cap applied.

The config additions are optional with serde defaults, and the orchestrator passes the new fields through when constructing the webhook channel. I do not see a new webhook-code blocker from this pass.

🟑 Warning β€” current red Security check appears covered by #6662

The only red check I see on this PR is the Security job / CI Required Gate. I am not including the raw Security job log in this comment, but the available public #6657/#6662 context strongly points to that failure being the lettre 0.11.21 / RUSTSEC-2026-0141 advisory, not a dependency introduced by this webhook retry PR.

#6662 updates the workspace and channel crate manifest floor plus Cargo.lock to lettre 0.11.22, and its PR body reports cargo deny check passing after that update. So I would treat the current red Security status here as base-branch dependency fallout that should clear once #6662, or equivalent lettre remediation, lands and this branch is rebased or refreshed onto current master.

After #6662 lands and this PR gets a fresh green run, the remaining review-state question is whether @theonlyhennygod wants to clear or update the earlier formal review.

@mn13
Copy link
Copy Markdown
Contributor Author

mn13 commented May 15, 2026

@theonlyhennygod β€” gentle ping for a re-review pass when you have time. Your 2026-04-18 CHANGES_REQUESTED review is the only remaining merge block; the head has moved through several follow-ups since then and the issues you raised look addressed:

Blocking items

  1. PR template fully filled β€” Summary, Validation Evidence, Security & Privacy, Compatibility, Rollback, i18n sections all completed in the body.
  2. risk: medium and size: M labels β€” applied.

Suggestions
3. async fn test functions β€” left as-is per schema.rs convention (the module imports tokio::test); @Audacity88's 2026-05-13 re-review accepts this.
4. zh-CN channels-reference β€” N/A; the referenced docs/i18n/zh-CN/reference/api/channels-reference.zh-CN.md path doesn't exist in the current repo layout. Retry fields are documented via schema doc-comments per the updated i18n section.
5. Test placeholder "bob" β†’ "zeroclaw_user" β€” done in df3bb2139.

Two follow-up commits since your review also tightened retry_max_delay_ms to a strict cap (5e572f1a) and threaded the retry config through announcements (67b3db98, by @Audacity88). All CI checks green. @Audacity88's latest re-review (2026-05-14) summarizes that no code blockers remain from their pass.

Happy to address anything further you flag.

@mn13 mn13 requested a review from Audacity88 May 15, 2026 07:24
@mn13 mn13 force-pushed the feat/webhook-retry-logic branch from 67b3db9 to b222221 Compare May 15, 2026 09:48
mn13 added a commit to mn13/zeroclaw that referenced this pull request May 15, 2026
…is strict

Addresses Audacity88's 2026-05-13 warning on zeroclaw-labs#5838.

The schema doc-comment promises `retry_max_delay_ms` is the maximum delay
cap for any single retry wait, but `compute_backoff` capped the base delay
first and then applied Β±25% jitter, so the returned wait could exceed the
configured cap by up to 25%.

Reorder to `apply_jitter(base).min(retry_max_delay_ms)` β€” jitter is now
absorbed by the strict cap, and the public wording matches the behavior.
Update the function doc-comment to state the cap is strict, tighten the
existing `backoff_capped_by_max_delay` test to assert equality at the cap,
and add `backoff_never_exceeds_max_delay_near_cap` (256-draw fuzz) covering
the previously-broken case where base β‰ˆ cap allowed jitter to push the
wait above the cap.
@mn13 mn13 requested a review from WareWolf-MoonWall as a code owner May 15, 2026 13:13
@github-actions github-actions Bot added the skills Auto scope: src/skills/** changed. label May 15, 2026
mn13 and others added 7 commits May 15, 2026 16:24
…utbound sends

Closes zeroclaw-labs#5761.

WebhookChannel::send now retries transient failures (network errors, 429,
5xx) with exponential backoff + jitter, honoring Retry-After on 429/503
and capping waits by retry_max_delay_ms. Non-429 4xx responses still fail
immediately. Tunable via three optional WebhookConfig fields
(max_retries=3, retry_base_delay_ms=500, retry_max_delay_ms=30000); set
max_retries=0 to keep the previous fire-and-forget behavior.
Post-review cleanup on dd93422:
- Drop dead `last_err` state and replace unreachable tail `bail!` with
  `unreachable!` β€” the loop always exits via return/bail on the final
  attempt, so the trailing error was never reached.
- Expand comment above the 429/503 Retry-After branch to explain why 429
  appears twice: once here (server-supplied delay) and once in the
  fallback exponential-backoff branch when no Retry-After header was sent.
- Note in `compute_backoff` doc that the `retry_max_delay_ms` cap is
  approximate β€” Β±25% jitter is applied after capping.
- Add `send_honors_retry_after_on_503` wiremock test mirroring the 429
  case, closing a gap where only 429+Retry-After was covered.
- Add schema tests for the 3 new `WebhookConfig` retry fields:
  `webhook_config_retry_fields_default_to_none` and
  `webhook_config_retry_fields_roundtrip` (JSON + TOML).
After the rebase onto master, two WebhookConfig { ... } struct literals
added in fe3ec58 (zeroclaw-labs#5799) no longer compile without the three retry
fields introduced in dd93422. Both are test-only call-sites, so `None`
is correct for each.
- Defer reading the response body until after the Retry-After early
  return. Hot 429/503 loops against servers that echo large error
  pages no longer pay the text() I/O cost per attempt.
- Merge the two `impl WebhookChannel` blocks so `attempt_send` lives
  alongside the other helpers in a single block.
- Clamp `retry_base_delay_ms` and `retry_max_delay_ms` to a 1ms
  minimum in `WebhookChannel::new` so a misconfigured `0` cannot
  busy-retry without yielding. Documented on the schema fields.
Per `docs/contributing/pr-discipline.md`, test fixtures should use
neutral, project-scoped placeholders. Replace `bob` with `zeroclaw_user`
in `incoming_payload_without_thread`.
…is strict

Addresses Audacity88's 2026-05-13 warning on zeroclaw-labs#5838.

The schema doc-comment promises `retry_max_delay_ms` is the maximum delay
cap for any single retry wait, but `compute_backoff` capped the base delay
first and then applied Β±25% jitter, so the returned wait could exceed the
configured cap by up to 25%.

Reorder to `apply_jitter(base).min(retry_max_delay_ms)` β€” jitter is now
absorbed by the strict cap, and the public wording matches the behavior.
Update the function doc-comment to state the cap is strict, tighten the
existing `backoff_capped_by_max_delay` test to assert equality at the cap,
and add `backoff_never_exceeds_max_delay_near_cap` (256-draw fuzz) covering
the previously-broken case where base β‰ˆ cap allowed jitter to push the
wait above the cap.
@mn13 mn13 force-pushed the feat/webhook-retry-logic branch from 7f88441 to 1f91920 Compare May 15, 2026 13:24
@github-actions github-actions Bot removed the skills Auto scope: src/skills/** changed. label May 15, 2026
Copy link
Copy Markdown
Collaborator

@WareWolf-MoonWall WareWolf-MoonWall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the full diff at current head (including the jitter-cap tightening in 5e572f1 and the announcement-wiring in 67b3db9), the PR body, the linked issue #5761, @theonlyhennygod's CHANGES_REQUESTED review (on 2111658), @Audacity88's two re-review comments (on e951944 and 5e572f1), and @mn13's latest ping. I also read FND-006 Β§4.1 and Β§4.3.

State check. @theonlyhennygod holds an active CHANGES_REQUESTED. I have no blocking items of my own; my findings below are acknowledgements and suggestions. Per protocol I am posting as a comment.


βœ… Resolved β€” PR template, labels, and fixture placeholder

All five items from @theonlyhennygod's original review are accounted for in the current head: the PR body is fully completed, risk: medium and size: M labels are applied, the "bob" fixture is now "zeroclaw_user" (df3bb21), the #[test] async fn shape follows the schema.rs module convention where tokio::test is already in scope, and the disputed zh-CN channels-reference path does not exist in the current repo layout.

βœ… Resolved β€” retry_max_delay_ms cap strictness

@Audacity88's 2026-05-13 warning about jitter potentially exceeding the configured cap is addressed in 5e572f1. compute_backoff now applies jitter to the uncapped base delay and then clamps the jittered value with .min(self.retry_max_delay_ms), making the cap strict. The backoff_never_exceeds_max_delay_near_cap test runs 256 draws and asserts no draw exceeds the cap β€” the correct way to confirm this.

βœ… Resolved β€” Blast radius wording

The PR body sentence about tokio::spawn caller isolation is now accurate: it presents the isolation as a caller-side contract rather than something this PR guarantees universally. @Audacity88's 2026-05-14 comment confirms this. No further action needed there.


🟒 What looks good β€” Error discrimination is correct

Non-429 4xx responses β†’ AttemptOutcome::Fatal (no retry). 429 and 5xx β†’ AttemptOutcome::Retry (exponential backoff). 429/503 with Retry-After β†’ AttemptOutcome::RetryAfter (server-supplied delay, capped). The test suite covers all four branches including Retry-After exhaustion. This is the right behavioral contract for a well-behaved HTTP client.

🟒 What looks good β€” Bounds and defaults

max_retries.saturating_add(1) prevents the loop from running zero times when max_retries = 0 (one attempt, no sleep, which is the intended "restore prior behavior" semantics). The >=1ms clamp on delays prevents a misconfigured zero from busy-retrying without yielding. The three config fields are all Option with #[serde(default)], preserving backward compatibility for every existing deployment.

🟒 What looks good β€” unreachable!() at loop tail

The unreachable!("send loop exits via return or bail on the final attempt") is correct usage per FND-006 Β§4.1. The loop invariant makes it logically impossible to fall through, and the message documents exactly why. This is the difference between documented judgment and deferred judgment.


πŸ”΅ Suggestion β€” #[allow(clippy::too_many_arguments)] on WebhookChannel::new

WebhookChannel::new now takes nine parameters. Suppressing clippy::too_many_arguments hides a real signal: the constructor is a candidate for a WebhookChannelConfig builder or a struct-literal approach, which would also make call sites in the orchestrator and delivery path easier to read. This is not blocking β€” the current shape is functional and the annotation is honest about what it does β€” but consider whether a config struct would be cleaner before this goes in, especially since the config fields already have a natural grouping in WebhookConfig.

πŸ”΅ Suggestion β€” #[allow(clippy::cast_possible_truncation, clippy::cast_sign_loss)] in apply_jitter

The cast from f64 to u64 in apply_jitter is safe here because delay_ms is bounded by retry_max_delay_ms (a u64) before jitter is applied, and the jitter factor [0.75, 1.25] cannot produce a value that overflows a u64 at any realistic delay. A one-line inline comment explaining that bound would make the suppression self-documenting for future readers: // Safe: delay_ms <= retry_max_delay_ms (u64); jitter factor [0.75, 1.25] cannot overflow. cast_possible_truncation was already present in this file at line 222; I am only noting the value of documenting the invariant.


Summary for @theonlyhennygod. From my read of the current head, all five items from your original review appear resolved. The two suggestions above are mine alone and are non-blocking. This PR is waiting on your re-review pass; once you update or dismiss your CHANGES_REQUESTED, it is ready to move forward.

Milestone note. feat(channels/webhook): maps to the channels scope. Flagging for @JordanTheJet to confirm milestone assignment.

@WareWolf-MoonWall WareWolf-MoonWall added this to the v0.8.1 milestone May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel:webhook channel Auto scope: src/channels/** changed. config Auto scope: src/config/** changed. daemon Auto scope: src/daemon/** changed. enhancement New feature or request risk: medium Auto risk: src/** or dependency/config changes. runtime Auto scope: src/runtime/** changed. size: M Auto size: 251-500 non-doc changed lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add retry logic with exponential backoff to webhook channel outbound sends

5 participants