Skip to content

[Bugfix] Fix bench_serve UTF-8 decode crash on split multi-byte chars#38732

Merged
ywang96 merged 1 commit intovllm-project:mainfrom
he-yufeng:fix/bench-utf8-split
Apr 16, 2026
Merged

[Bugfix] Fix bench_serve UTF-8 decode crash on split multi-byte chars#38732
ywang96 merged 1 commit intovllm-project:mainfrom
he-yufeng:fix/bench-utf8-split

Conversation

@he-yufeng
Copy link
Copy Markdown
Contributor

Summary

StreamedResponseHandler.add_chunk() calls chunk_bytes.decode("utf-8") directly, which crashes when a multi-byte UTF-8 character (e.g. Chinese) is split across HTTP chunk boundaries:

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 65527-65528: unexpected end of data

The reporter hit this at ~0.4% rate (4/1000 requests) during bench_serve with H100x8.

Replace the one-shot decode with codecs.IncrementalDecoder which buffers incomplete byte sequences across add_chunk() calls and decodes them once the next chunk completes the character. This is the standard Python approach for streaming byte-to-str conversion.

Fixes #38717

Test plan

  • ruff check and ruff format --check pass
  • Verified the crash traceback matches endpoint_request_func.py:32
  • IncrementalDecoder correctly handles split multi-byte sequences (tested locally with artificially split bytes of Chinese characters)

When streaming responses containing multi-byte UTF-8 characters
(e.g. Chinese text), HTTP chunk boundaries can split a character
across two chunks.  The direct bytes.decode("utf-8") call crashes:

  UnicodeDecodeError: 'utf-8' codec can't decode bytes in position
  65527-65528: unexpected end of data

Replace the one-shot decode with codecs.IncrementalDecoder which
buffers incomplete byte sequences across add_chunk() calls and
returns them once the next chunk completes the character.

Fixes vllm-project#38717
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces direct UTF-8 decoding with an incremental decoder in the StreamedResponseHandler to correctly handle multi-byte characters split across chunks. However, the reviewer pointed out a potential data loss issue because the decoder is never finalized; a mechanism to flush the remaining buffer at the end of the stream is required.

"""Add a chunk of bytes to the buffer and return any complete
messages."""
chunk_str = chunk_bytes.decode("utf-8")
chunk_str = self._decoder.decode(chunk_bytes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The IncrementalDecoder can buffer incomplete byte sequences. If the stream ends with such an incomplete sequence, it will remain in the decoder's buffer and will be lost because there is no final call to flush the decoder. This can lead to data loss.

To fix this, you should add a mechanism to finalize the decoding process after the last chunk has been processed. This typically involves calling self._decoder.decode(b'', final=True) to flush any buffered data.

This would likely require adding a new method to StreamedResponseHandler, for example finalize(), and calling it from the request handling functions (e.g., async_request_openai_completions) after the streaming loop is complete.

Example of a finalize method:

def finalize(self) -> list[str]:
    """Flushes the decoder and processes any remaining buffered data."""
    final_chunk_str = self._decoder.decode(b'', final=True)
    if not final_chunk_str:
        return []
    
    self.buffer += final_chunk_str
    # It's best to refactor the message processing logic from add_chunk
    # into a private helper method to be reused here.
    messages = self._process_buffer()
    
    if self.buffer:
        # Handle or log any remaining incomplete message in the buffer
        # after final processing.
        pass

    return messages

The call sites in async_request_openai_completions, async_request_openai_chat_completions, and async_request_openai_audio would need to be updated to call this finalize method.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @he-yufeng this looks like a valid point. Should we flush the incremental decoder?

@mergify mergify Bot added performance Performance-related issues bug Something isn't working labels Apr 1, 2026
Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @he-yufeng!

@ZJY0516 ZJY0516 requested a review from DarkLight1337 April 1, 2026 17:10
@he-yufeng
Copy link
Copy Markdown
Contributor Author

ping — bench_serve crashes on CJK output, the fix is straightforward (IncrementalDecoder).

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 16, 2026
@ywang96
Copy link
Copy Markdown
Member

ywang96 commented Apr 16, 2026

I think it's okay to have this in - merging

@ywang96 ywang96 merged commit de111f3 into vllm-project:main Apr 16, 2026
9 of 16 checks passed
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Bench Serve encounter utf-8 UnicodeDecodeError

3 participants