Skip to content

BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests #37648

@hslee-lunit

Description

@hslee-lunit

Description

The /pooling endpoint for BGE-M3 (BgeM3EmbeddingModel) crashes after approximately 50-100 sequential requests. Both embed (dense) and token_classify (sparse) tasks are affected. The error kills the engine entirely (EngineDeadError), requiring a full container restart.

Single requests work fine. The crash occurs only after accumulating multiple requests.

Environment

  • vLLM versions tested: v0.15.1 and latest (v0.17.x as of 2026-03-20)
  • Model: BAAI/bge-m3 with --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'
  • GPU: NVIDIA B200 (183GB)
  • Docker image: vllm/vllm-openai:v0.15.1 and vllm/vllm-openai:latest

Steps to Reproduce

# Launch
docker run --gpus '"device=0"' -d --network host \
  --entrypoint vllm vllm/vllm-openai:v0.15.1 \
  serve BAAI/bge-m3 --host 0.0.0.0 --port 9001 \
  --trust-remote-code \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

# Wait for model to load, then send ~100 sequential requests
for i in $(seq 1 100); do
  curl -s http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test sentence number '$i'"]}' > /dev/null
done
# Server crashes around request 50-100

Error Log

(EngineCore_DP0 pid=405)   File ".../vllm/model_executor/layers/pooler/tokwise/methods.py", line 50, in forward
(EngineCore_DP0 pid=405)     hidden_states_all = hidden_states.split(
(EngineCore_DP0 pid=405)                         ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=405) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 4192 (input tensor's size at dimension 0), but got split_sizes=[74, 43, 230, 510, 646, 348, 205, 77]
(APIServer pid=1) ERROR [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.

Behavior

  • Single requests to /pooling (both embed and token_classify tasks) work correctly
  • After ~50-100 sequential single-item requests, the engine crashes with split_with_sizes mismatch
  • The crash is fatal — the engine enters EngineDeadError state and all subsequent requests return 500
  • Batch requests (multiple items in input array) crash immediately on the first request
  • Input text length does not matter — crash occurs even with very short texts (~10 tokens)

Workaround

Restart the vLLM container after every ~50 requests. Deterministic point IDs allow resumable indexing.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions