What happened
When per-request reporting is enabled (report.request_lifecycle.per_request: true), requests that fail while streaming an HTTP 200 response are recorded with an empty response field, so there is no way to tell why they failed from the per-request report. This defeats the main purpose of the per-request escape hatch.
We hit this with a model server that returned 200s but whose responses were marked as failed by inference-perf. Inspecting per_request_lifecycle_metrics.json gave no insight into the cause.
Root cause
In the streaming branch of OpenAIModelServerClient.process_request (inference_perf/client/modelserver/openai_client.py):
if self.client.api_config.streaming and response.status == 200:
info = await data.process_response(...) # raises mid-stream
response_content = info.extra_info.get("raw_response", "") if info else "" # never runs
response_content is only assigned after process_response returns. If the stream breaks partway (truncated SSE, dropped connection, or a proxy that 200s then sends an error page), parse_sse_stream raises and the raw bytes it had already accumulated are discarded, so response_content stays "". The non-streaming path does not have this problem because it reads the body (await response.text()) before parsing.
Reproduction
Point inference-perf (streaming completion) at a server that returns 200 OK with Content-Length larger than the bytes actually sent, then closes the connection (triggers aiohttp.ClientPayloadError mid-stream). The per-request entry looks like:
{
"response": "",
"error": {
"error_type": "ClientPayloadError",
"error_msg": "Response payload is not completed: ..."
}
}
The bytes the server actually sent are gone.
Expected
The per-request report should retain whatever bytes were received before the failure so the failure is diagnosable, e.g.:
{
"response": "data: {\"choices\":[{\"text\":\"Hello \"}]}\n\ndata: {\"choices\":[{\"text\":\"world \"}]}\n\n",
"error": { "error_type": "ClientPayloadError", "error_msg": "Response payload is not completed: ..." }
}
Related follow-ups (out of scope for the initial fix)
- A 200 whose SSE body is an in-band error payload (
{"error": ...} with no choices) is silently parsed to empty output and marked success, not failed.
- Per-request entries omit
stage_id and per-request latency (TTFT/ITL/TPOT), which are available at emit time.
What happened
When per-request reporting is enabled (
report.request_lifecycle.per_request: true), requests that fail while streaming an HTTP 200 response are recorded with an emptyresponsefield, so there is no way to tell why they failed from the per-request report. This defeats the main purpose of the per-request escape hatch.We hit this with a model server that returned 200s but whose responses were marked as failed by inference-perf. Inspecting
per_request_lifecycle_metrics.jsongave no insight into the cause.Root cause
In the streaming branch of
OpenAIModelServerClient.process_request(inference_perf/client/modelserver/openai_client.py):response_contentis only assigned afterprocess_responsereturns. If the stream breaks partway (truncated SSE, dropped connection, or a proxy that 200s then sends an error page),parse_sse_streamraises and the raw bytes it had already accumulated are discarded, soresponse_contentstays"". The non-streaming path does not have this problem because it reads the body (await response.text()) before parsing.Reproduction
Point inference-perf (streaming completion) at a server that returns
200 OKwithContent-Lengthlarger than the bytes actually sent, then closes the connection (triggersaiohttp.ClientPayloadErrormid-stream). The per-request entry looks like:{ "response": "", "error": { "error_type": "ClientPayloadError", "error_msg": "Response payload is not completed: ..." } }The bytes the server actually sent are gone.
Expected
The per-request report should retain whatever bytes were received before the failure so the failure is diagnosable, e.g.:
{ "response": "data: {\"choices\":[{\"text\":\"Hello \"}]}\n\ndata: {\"choices\":[{\"text\":\"world \"}]}\n\n", "error": { "error_type": "ClientPayloadError", "error_msg": "Response payload is not completed: ..." } }Related follow-ups (out of scope for the initial fix)
{"error": ...}with nochoices) is silently parsed to empty output and marked success, not failed.stage_idand per-request latency (TTFT/ITL/TPOT), which are available at emit time.