Issue: Team Mode Members Lack Model Fallback Retry — All Members Die on Primary Provider Quota Exhaustion
Current Behavior
When a team-mode member session encounters a retryable model error (rate limit, quota exhausted, model unavailable), the team-member-error-handler.ts only marks the member status as "errored". There is no fallback retry mechanism. The member stays dead for the remainder of the team run.
This is especially severe for categories that default to openai/gpt-5.5 (e.g., ultrabrain, deep) because:
- All 4 hyperplan members simultaneously hit the same provider quota
- Every member errors within seconds of each other
- The team run collapses with no recovery path
Expected Behavior
Team members should follow the same fallback retry strategy that background tasks (fallback-retry-handler.ts) and sync delegated tasks (sync-task-fallback.ts) already implement:
- Detect retryable errors via
shouldRetryError()
- Walk the pre-computed
fallbackChain for the member's resolved model
- Spawn a new session with the next fallback model
- Update the team registry so the new session retains membership
- Resume member execution transparently
Root Cause
src/hooks/team-session-events/team-member-error-handler.ts (lines 14-52):
export function createTeamMemberErrorHandler(config: TeamModeConfig): HookImpl {
return async ({ event }: HookInput): Promise<void> => {
if (event.type !== "session.error") return
// ...resolves member, loads runtime state...
await transitionRuntimeState(runtimeState.teamRunId, (currentRuntimeState) => ({
...currentRuntimeState,
members: currentRuntimeState.members.map((member) => (
member.name === runtimeMember.memberName
? { ...member, status: "errored" } // ← Only marks as errored. No retry.
: member
)),
}), config)
// Logs error. Done.
}
}
Compare with src/features/background-agent/fallback-retry-handler.ts which:
- Checks
shouldRetryError(errorInfo)
- Iterates
fallbackChain via getNextFallback()
- Validates reachability with
isReachable() + selectFallbackProvider()
- Skips no-op fallbacks
- Enqueues retry with
scheduleRetryAttempt()
- Aborts old session, creates new session
Prior Art (Already Fixed for Background Tasks)
These are necessary but insufficient — they don't actually retry the member with a fallback model.
Reproduction Steps
- Enable team mode (
team_mode.enabled: true)
- Create a team with 4 category members (e.g., hyperplan roster)
- Have all members route to
openai/gpt-5.5 (default for ultrabrain/deep)
- Exhaust OpenAI quota or trigger rate limits
- Observe: all members immediately transition to
"errored"
- Observe:
team_status shows errored members, no recovery attempted
Proposed Solution
Integrate fallback retry into team-member-error-handler.ts by:
- On
session.error: After resolving the member, check if error is retryable via shouldRetryError()
- Retrieve fallback chain: The member's
fallbackChain is already stored in runtime state via resolveMember() → ResolvedMember.fallbackChain
- Walk the chain: Use the same logic as
fallback-retry-handler.ts to find the next reachable, non-no-op fallback
- Spawn replacement session: Create a new subagent session with the fallback model, using the existing
teamRunId and member identity
- Update registry: Register the new session ID in the team-session-registry under the same member name
- Update runtime state: Replace the old session ID with the new one, set member status back to
"running"
- Preserve context: Carry forward member
pendingInjectedMessageIds and any accumulated mailbox messages
Acceptance Criteria
Environment
Labels
bug, team-mode, fallback, resilience
Issue: Team Mode Members Lack Model Fallback Retry — All Members Die on Primary Provider Quota Exhaustion
Current Behavior
When a team-mode member session encounters a retryable model error (rate limit, quota exhausted, model unavailable), the
team-member-error-handler.tsonly marks the member status as"errored". There is no fallback retry mechanism. The member stays dead for the remainder of the team run.This is especially severe for categories that default to
openai/gpt-5.5(e.g.,ultrabrain,deep) because:Expected Behavior
Team members should follow the same fallback retry strategy that background tasks (
fallback-retry-handler.ts) and sync delegated tasks (sync-task-fallback.ts) already implement:shouldRetryError()fallbackChainfor the member's resolved modelRoot Cause
src/hooks/team-session-events/team-member-error-handler.ts(lines 14-52):Compare with
src/features/background-agent/fallback-retry-handler.tswhich:shouldRetryError(errorInfo)fallbackChainviagetNextFallback()isReachable()+selectFallbackProvider()scheduleRetryAttempt()Prior Art (Already Fixed for Background Tasks)
b516d5d41e) — PreservesteamRunIdacross background task fallback so the fallback session remains a team participant2bf5038215) — Surfaces member errors to the lead via mailbox messagesThese are necessary but insufficient — they don't actually retry the member with a fallback model.
Reproduction Steps
team_mode.enabled: true)openai/gpt-5.5(default forultrabrain/deep)"errored"team_statusshows errored members, no recovery attemptedProposed Solution
Integrate fallback retry into
team-member-error-handler.tsby:session.error: After resolving the member, check if error is retryable viashouldRetryError()fallbackChainis already stored in runtime state viaresolveMember()→ResolvedMember.fallbackChainfallback-retry-handler.tsto find the next reachable, non-no-op fallbackteamRunIdand member identity"running"pendingInjectedMessageIdsand any accumulated mailbox messagesAcceptance Criteria
team_statusshows the member as"running"after successful fallback spawn"errored"and surfaces the error to the leadEnvironment
Labels
bug,team-mode,fallback,resilience