Skip to content

Per-request report drops the response body when a streaming request fails on HTTP 200 #531

@Bslabe123

Description

@Bslabe123

What happened

When per-request reporting is enabled (report.request_lifecycle.per_request: true), requests that fail while streaming an HTTP 200 response are recorded with an empty response field, so there is no way to tell why they failed from the per-request report. This defeats the main purpose of the per-request escape hatch.

We hit this with a model server that returned 200s but whose responses were marked as failed by inference-perf. Inspecting per_request_lifecycle_metrics.json gave no insight into the cause.

Root cause

In the streaming branch of OpenAIModelServerClient.process_request (inference_perf/client/modelserver/openai_client.py):

if self.client.api_config.streaming and response.status == 200:
    info = await data.process_response(...)                      # raises mid-stream
    response_content = info.extra_info.get("raw_response", "") if info else ""   # never runs

response_content is only assigned after process_response returns. If the stream breaks partway (truncated SSE, dropped connection, or a proxy that 200s then sends an error page), parse_sse_stream raises and the raw bytes it had already accumulated are discarded, so response_content stays "". The non-streaming path does not have this problem because it reads the body (await response.text()) before parsing.

Reproduction

Point inference-perf (streaming completion) at a server that returns 200 OK with Content-Length larger than the bytes actually sent, then closes the connection (triggers aiohttp.ClientPayloadError mid-stream). The per-request entry looks like:

{
  "response": "",
  "error": {
    "error_type": "ClientPayloadError",
    "error_msg": "Response payload is not completed: ..."
  }
}

The bytes the server actually sent are gone.

Expected

The per-request report should retain whatever bytes were received before the failure so the failure is diagnosable, e.g.:

{
  "response": "data: {\"choices\":[{\"text\":\"Hello \"}]}\n\ndata: {\"choices\":[{\"text\":\"world \"}]}\n\n",
  "error": { "error_type": "ClientPayloadError", "error_msg": "Response payload is not completed: ..." }
}

Related follow-ups (out of scope for the initial fix)

  • A 200 whose SSE body is an in-band error payload ({"error": ...} with no choices) is silently parsed to empty output and marked success, not failed.
  • Per-request entries omit stage_id and per-request latency (TTFT/ITL/TPOT), which are available at emit time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions