Skip to content

fix(providers): add circuit breaker for Responses API fallback#3205

Merged
chengyongru merged 3 commits intoHKUDS:nightlyfrom
mohamed-elkholy95:fix/responses-api-circuit-breaker
Apr 19, 2026
Merged

fix(providers): add circuit breaker for Responses API fallback#3205
chengyongru merged 3 commits intoHKUDS:nightlyfrom
mohamed-elkholy95:fix/responses-api-circuit-breaker

Conversation

@mohamed-elkholy95
Copy link
Copy Markdown
Contributor

Summary

  • Add a proper circuit breaker for Responses API fallback in OpenAICompatProvider
  • After 3 consecutive compatibility errors for a (model, reasoning_effort) key, skip the Responses API and go straight to Chat Completions
  • Circuit probes again after 5 minutes (half-open state) — recovery is automatic
  • Success immediately resets the failure counter
  • Addresses the issue where a transient Responses API outage could permanently degrade output for the lifetime of the process

Test plan

  • 7 new tests covering: default availability, threshold opening, model isolation, success reset, time-based probe, below-threshold pass, reasoning_effort key separation
  • All 209 existing provider tests pass (no regressions)
  • ruff check --select F401,F841 clean

🤖 Generated with Claude Code

When the Responses API fails repeatedly (3 consecutive compatibility
errors), skip it and fall back directly to Chat Completions.  Unlike a
permanent disable, the circuit re-probes after 5 minutes so recovery
is automatic when the API comes back.  Success resets the counter.

Keyed per (model, reasoning_effort) so a failure with one model does
not affect others.
Copy link
Copy Markdown
Collaborator

@chengyongru chengyongru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix(providers): add circuit breaker for Responses API fallback

结论: Approve

这是一个设计良好的断路器实现。

优点:

  • (model, reasoning_effort) 键隔离,一个模型的故障不影响其他模型
  • 阈值 (3次) 和探测间隔 (5分钟) 作为模块级常量清晰定义
  • 半开 (half-open) 状态设计正确:超过冷却期后允许一次探测
  • 成功时立即重置所有计数器,恢复迅速
  • 在非流式和流式两条代码路径都正确集成了 _record_responses_success/failure
  • 测试覆盖充分 (7个测试),覆盖了所有关键场景

小建议(不阻塞合并):

  1. _record_responses_failure 里有内联 from loguru import logger,建议移到文件顶部与其他 import 保持一致。这不是热路径,每次调用都走 import 缓存不会有性能问题,但放在顶部更符合惯例。

  2. 断路器的 _responses_failures_responses_tripped_at 字典会随着不同 key 不断增长(虽然实际上 key 空间很小)。如果在意极端情况,可以在 _record_responses_success 中确认两个 dict 的清理。当前实现已经在做 pop 了,所以这已经处理了。

Good to merge.

@chengyongru
Copy link
Copy Markdown
Collaborator

Review: fix(providers): add circuit breaker for Responses API fallback

Verdict: Approve

Well-designed circuit breaker implementation.

Pros:

  • Isolated per (model, reasoning_effort) key — one model's failures don't affect others
  • Threshold (3) and probe interval (5 min) are clearly defined as module-level constants
  • Correct half-open state: allows one probe attempt after the cooldown period
  • Success immediately resets all counters for quick recovery
  • Properly integrated in both non-streaming and streaming code paths
  • Adequate test coverage (7 tests) covering all key scenarios

Minor suggestions (non-blocking):

  1. _record_responses_failure has an inline from loguru import logger. Consider moving it to the top of the file with the other imports for consistency. This isn't a hot path so the import cache handles it fine, but top-level is more conventional.

  2. The _responses_failures and _responses_tripped_at dicts grow with distinct keys (though the key space is tiny in practice). _record_responses_success already does pop to clean up, so this is handled.

Good to merge.

@mohamed-elkholy95 mohamed-elkholy95 force-pushed the fix/responses-api-circuit-breaker branch from e9d7a85 to 6c4ba2d Compare April 18, 2026 23:50
@chengyongru chengyongru merged commit adcd3fe into HKUDS:nightly Apr 19, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants