Skip to content

Team mode members lack model fallback retry — all members die on primary provider quota exhaustion #4420

@PeterPonyu

Description

@PeterPonyu

Issue: Team Mode Members Lack Model Fallback Retry — All Members Die on Primary Provider Quota Exhaustion

Current Behavior

When a team-mode member session encounters a retryable model error (rate limit, quota exhausted, model unavailable), the team-member-error-handler.ts only marks the member status as "errored". There is no fallback retry mechanism. The member stays dead for the remainder of the team run.

This is especially severe for categories that default to openai/gpt-5.5 (e.g., ultrabrain, deep) because:

  1. All 4 hyperplan members simultaneously hit the same provider quota
  2. Every member errors within seconds of each other
  3. The team run collapses with no recovery path

Expected Behavior

Team members should follow the same fallback retry strategy that background tasks (fallback-retry-handler.ts) and sync delegated tasks (sync-task-fallback.ts) already implement:

  • Detect retryable errors via shouldRetryError()
  • Walk the pre-computed fallbackChain for the member's resolved model
  • Spawn a new session with the next fallback model
  • Update the team registry so the new session retains membership
  • Resume member execution transparently

Root Cause

src/hooks/team-session-events/team-member-error-handler.ts (lines 14-52):

export function createTeamMemberErrorHandler(config: TeamModeConfig): HookImpl {
  return async ({ event }: HookInput): Promise<void> => {
    if (event.type !== "session.error") return
    // ...resolves member, loads runtime state...
    await transitionRuntimeState(runtimeState.teamRunId, (currentRuntimeState) => ({
      ...currentRuntimeState,
      members: currentRuntimeState.members.map((member) => (
        member.name === runtimeMember.memberName
          ? { ...member, status: "errored" }  // ← Only marks as errored. No retry.
          : member
      )),
    }), config)
    // Logs error. Done.
  }
}

Compare with src/features/background-agent/fallback-retry-handler.ts which:

  1. Checks shouldRetryError(errorInfo)
  2. Iterates fallbackChain via getNextFallback()
  3. Validates reachability with isReachable() + selectFallbackProvider()
  4. Skips no-op fallbacks
  5. Enqueues retry with scheduleRetryAttempt()
  6. Aborts old session, creates new session

Prior Art (Already Fixed for Background Tasks)

These are necessary but insufficient — they don't actually retry the member with a fallback model.

Reproduction Steps

  1. Enable team mode (team_mode.enabled: true)
  2. Create a team with 4 category members (e.g., hyperplan roster)
  3. Have all members route to openai/gpt-5.5 (default for ultrabrain/deep)
  4. Exhaust OpenAI quota or trigger rate limits
  5. Observe: all members immediately transition to "errored"
  6. Observe: team_status shows errored members, no recovery attempted

Proposed Solution

Integrate fallback retry into team-member-error-handler.ts by:

  1. On session.error: After resolving the member, check if error is retryable via shouldRetryError()
  2. Retrieve fallback chain: The member's fallbackChain is already stored in runtime state via resolveMember()ResolvedMember.fallbackChain
  3. Walk the chain: Use the same logic as fallback-retry-handler.ts to find the next reachable, non-no-op fallback
  4. Spawn replacement session: Create a new subagent session with the fallback model, using the existing teamRunId and member identity
  5. Update registry: Register the new session ID in the team-session-registry under the same member name
  6. Update runtime state: Replace the old session ID with the new one, set member status back to "running"
  7. Preserve context: Carry forward member pendingInjectedMessageIds and any accumulated mailbox messages

Acceptance Criteria

  • When a team member hits a retryable error, it automatically retries with the next fallback model
  • The fallback session is correctly registered as the same team member
  • team_status shows the member as "running" after successful fallback spawn
  • If all fallbacks are exhausted, the member correctly transitions to "errored" and surfaces the error to the lead
  • Non-retryable errors (e.g., context length, permission denied) still immediately mark the member as errored
  • Regression test: member error → fallback → member resumes with new session ID

Environment

Labels

bug, team-mode, fallback, resilience

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions