Skip to content

FallbackChatClient: classify transport-level failures as transient#427

Merged
rockfordlhotka merged 1 commit into
mainfrom
fix/fallback-classify-network-errors
May 20, 2026
Merged

FallbackChatClient: classify transport-level failures as transient#427
rockfordlhotka merged 1 commit into
mainfrom
fix/fallback-classify-network-errors

Conversation

@rockfordlhotka

Copy link
Copy Markdown
Member

Summary

  • The OpenAI/Azure SDK's retry policy exhausts on transport-level failures (DNS, TCP reset, "Resource temporarily unavailable") and surfaces ClientResultException with Status == 0 (no HTTP response received). The previous classifier required a known HTTP status, mapped 0/null to Unknown, and re-threw immediately — so the Balanced tier never fell back to OpenRouter when Azure's endpoint was unreachable.
  • ClassifyException now treats four previously-Unknown shapes as Transient:
    • ClientResultException with Status == 0
    • HttpRequestException with StatusCode == null
    • Any exception whose inner chain contains SocketException or IOException
    • Message text matching "Retry failed after" or "temporarily unavailable" (catch-all wrapper)
  • Four regression tests added; the message-text test uses the exact shape observed in production (Retry failed after 4 tries. (Resource temporarily unavailable (rocky-ml1nznjr-eastus2.cognitiveservices.azure.com:443))).

Background

A subagent fan-out + synthesis run failed end-to-end against the Balanced-tier Azure endpoint with the message above. OpenRouter is configured as the Balanced fallback but was never tried — both the subagent's own LLM call and the synthesis call gave up after the SDK's internal retries, because FallbackChatClient couldn't recognize the resulting exception as transient.

Test plan

  • dotnet test tests/RockBot.Llm.Tests/RockBot.Llm.Tests.csproj — 134/134 pass
  • Watch for fallback log lines (FallbackChatClient: falling back from … to …) the next time Azure throttles the Balanced endpoint

🤖 Generated with Claude Code

The SDK retry policy can give up on a transport-level failure (DNS, TCP
reset, "Resource temporarily unavailable") with "Retry failed after N
tries." — no HTTP response is received, so ClientResultException.Status
is 0 and HttpRequestException.StatusCode is null. The previous classifier
required a known HTTP status code and mapped 0/null to Unknown, which
re-throws immediately and skips the configured fallback model.

Now Status==0 ClientResultException, null-StatusCode HttpRequestException,
inner SocketException/IOException, and the "Retry failed after" /
"temporarily unavailable" message wrappers all classify as Transient, so
the Balanced tier actually falls through to OpenRouter when Azure's
endpoint is unreachable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rockfordlhotka rockfordlhotka merged commit 9f54ca4 into main May 20, 2026
2 checks passed
@rockfordlhotka rockfordlhotka deleted the fix/fallback-classify-network-errors branch May 20, 2026 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant