Skip to content

feat(sdk): add retry-with-backoff for transient localization failures#2105

Merged
AndreyHirsa merged 1 commit into
mainfrom
feat/retries
Jun 9, 2026
Merged

feat(sdk): add retry-with-backoff for transient localization failures#2105
AndreyHirsa merged 1 commit into
mainfrom
feat/retries

Conversation

@AndreyHirsa

@AndreyHirsa AndreyHirsa commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Add automatic retry-with-backoff for transient localization failures in the SDK, so a 5xx response or network blip no longer fails the whole request.

Changes

  • Add fetchWithRetry to LingoDotDevEngine, used by localizeChunk: retries on 5xx responses and network errors with exponential backoff + full jitter.
  • Add two config options — maxRetries (default 3) and retryDelayMs (default 500); maxRetries: 0 disables retries.
  • Retry decision is based on the HTTP status code (>= 500); 4xx responses and aborted requests (signal.aborted) are never retried.
  • Cancel discarded 5xx response bodies before retrying so the underlying connection is released back to the pool.

Testing

Business logic tests added:

  • Retries a 5xx response and returns the result once a later attempt succeeds (asserts 3 fetch calls)
  • Throws after exhausting maxRetries on persistent 5xx (asserts initial attempt + N retries)
  • Does not retry 4xx responses (fails fast on Invalid request)
  • Retries transient network errors (rejected fetch) and recovers
  • maxRetries: 0 performs a single attempt with no retries
  • Does not retry an already-aborted request (no fetch issued)
  • All tests pass locally

Visuals

Required for UI/UX changes:

N/A

Checklist

  • Changeset added (if version bump needed)
  • Tests cover business logic (not just happy path)
  • No breaking changes (or documented below)

Closes N/A

Summary by CodeRabbit

  • New Features

    • Localization requests now automatically retry on transient failures (5xx errors and network issues) using exponential backoff.
    • Added configurable options: maxRetries (default: 3) and retryDelayMs (default: 500ms) to control retry behavior.
  • Tests

    • Added comprehensive test coverage for retry behavior across various failure scenarios.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds configurable exponential backoff retry logic to the SDK engine for transient localization failures. New maxRetries and retryDelayMs engine parameters control retry behavior; a new fetchWithRetry helper implements the retry subsystem with backoff and jitter, and localizeChunk now uses it for /process/localize requests. Tests validate retry scenarios across 5xx, 4xx, network errors, and abort cases.

Changes

Localization Retry Resilience

Layer / File(s) Summary
Retry configuration and helpers
packages/sdk/src/index.ts
Engine parameters schema adds maxRetries (default 3) and retryDelayMs (default 500). New private helpers sleep, backoffDelay, and fetchWithRetry implement retry logic: transient failures (5xx, network errors) are retried with exponential backoff and full jitter, abort signals are respected, and response bodies are drained before retrying.
Localization request integration
packages/sdk/src/index.ts
localizeChunk switches from direct fetch to this.fetchWithRetry for /process/localize requests, preserving the provided AbortSignal and inheriting retry semantics.
Retry behavior validation
packages/sdk/src/index.spec.ts
New test suite covering retry scenarios: successful retries after 5xx, exhaustion throws after max retries, 4xx errors do not retry, network exceptions trigger retries, maxRetries: 0 disables retries, and aborted requests skip fetch invocation. Mock factories and setup included.
Release documentation
.changeset/yummy-snails-return.md
Changeset entry for @lingo.dev/_sdk patch release documenting retry behavior, engine parameters, and defaults.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • lingodotdev/lingo.dev#2051: Refactors localizeChunk error handling with centralized throwOnHttpError/extractErrorMessage, modifying the same network request path that this PR adds retry logic to.
  • lingodotdev/lingo.dev#2027: Changes the /process/localize endpoint and payload routing for vNext, affecting the same localizeChunk request flow that this PR enhances with retry behavior.

Suggested reviewers

  • vrcprl
  • cherkanovart

Poem

🐰 A SDK with resolve so true,
Retries five-hundreds with backoff anew,
Jitter and sleep keep the storms at bay,
Transient troubles? They'll pass away! ✨
Lingo bounces back, come what may.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: adding retry-with-backoff for transient localization failures, which is the primary focus of the PR.
Description check ✅ Passed The description follows the template structure with complete Summary, Changes, Testing sections with checked items, and a Checklist. All required sections are present and substantive.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/retries

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/sdk/src/index.spec.ts (1)

605-624: ⚡ Quick win

Add a regression test for abort occurring during retry backoff.

Current coverage checks already-aborted signals, but not cancellation that happens after the first failed attempt while waiting to retry. That case would guard the new backoff sleep path directly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/sdk/src/index.spec.ts` around lines 605 - 624, Add a regression test
that simulates abort during retry backoff by exercising
LingoDotDevEngine.localizeObject: configure engine with retryDelayMs > 0, have
the first fetch attempt fail (mockFetch rejects), start the localization call,
then trigger controller.abort() while the engine is awaiting the backoff sleep
(use fake timers to advance time appropriately); assert the call rejects with an
"aborted" error and that mockFetch was not called a second time. This targets
the retry/backoff sleep path to ensure cancellation during backoff stops further
retries.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/sdk/src/index.ts`:
- Around line 97-112: The sleep function can miss an abort that happens between
the early signal.aborted check and addEventListener; fix it by registering the
onAbort listener before checking signal.aborted, then immediately check
signal.aborted and if set call onAbort (or reject) to avoid the race; ensure you
still clear the timeout and remove the listener in both the timer callback and
the abort handler for proper cleanup (references: sleep, onAbort, timer).

---

Nitpick comments:
In `@packages/sdk/src/index.spec.ts`:
- Around line 605-624: Add a regression test that simulates abort during retry
backoff by exercising LingoDotDevEngine.localizeObject: configure engine with
retryDelayMs > 0, have the first fetch attempt fail (mockFetch rejects), start
the localization call, then trigger controller.abort() while the engine is
awaiting the backoff sleep (use fake timers to advance time appropriately);
assert the call rejects with an "aborted" error and that mockFetch was not
called a second time. This targets the retry/backoff sleep path to ensure
cancellation during backoff stops further retries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4f23073a-203d-4da3-aeff-f8659b26ecf3

📥 Commits

Reviewing files that changed from the base of the PR and between ecfe7b1 and f8f974b.

📒 Files selected for processing (3)
  • .changeset/yummy-snails-return.md
  • packages/sdk/src/index.spec.ts
  • packages/sdk/src/index.ts

Comment thread packages/sdk/src/index.ts
Comment on lines +97 to +112
private static sleep(ms: number, signal?: AbortSignal): Promise<void> {
return new Promise((resolve, reject) => {
if (signal?.aborted) {
reject(new Error("Operation was aborted"));
return;
}
const onAbort = () => {
clearTimeout(timer);
reject(new Error("Operation was aborted"));
};
const timer = setTimeout(() => {
signal?.removeEventListener("abort", onAbort);
resolve();
}, ms);
signal?.addEventListener("abort", onAbort, { once: true });
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix abort-listener registration race in sleep.

An abort between the early signal.aborted check and addEventListener can be missed, causing delayed cancellation until the timeout elapses.

💡 Suggested patch
   private static sleep(ms: number, signal?: AbortSignal): Promise<void> {
     return new Promise((resolve, reject) => {
       if (signal?.aborted) {
         reject(new Error("Operation was aborted"));
         return;
       }
-      const onAbort = () => {
-        clearTimeout(timer);
-        reject(new Error("Operation was aborted"));
-      };
-      const timer = setTimeout(() => {
-        signal?.removeEventListener("abort", onAbort);
-        resolve();
-      }, ms);
-      signal?.addEventListener("abort", onAbort, { once: true });
+      let timer: ReturnType<typeof setTimeout> | undefined;
+      const onAbort = () => {
+        if (timer) clearTimeout(timer);
+        signal?.removeEventListener("abort", onAbort);
+        reject(new Error("Operation was aborted"));
+      };
+      signal?.addEventListener("abort", onAbort, { once: true });
+      if (signal?.aborted) {
+        onAbort();
+        return;
+      }
+      timer = setTimeout(() => {
+        signal?.removeEventListener("abort", onAbort);
+        resolve();
+      }, ms);
     });
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/sdk/src/index.ts` around lines 97 - 112, The sleep function can miss
an abort that happens between the early signal.aborted check and
addEventListener; fix it by registering the onAbort listener before checking
signal.aborted, then immediately check signal.aborted and if set call onAbort
(or reject) to avoid the race; ensure you still clear the timeout and remove the
listener in both the timer callback and the abort handler for proper cleanup
(references: sleep, onAbort, timer).

@AndreyHirsa AndreyHirsa merged commit 7925cb1 into main Jun 9, 2026
10 checks passed
@AndreyHirsa AndreyHirsa deleted the feat/retries branch June 9, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants