Description
The /pooling endpoint for BGE-M3 (BgeM3EmbeddingModel) crashes after approximately 50-100 sequential requests. Both embed (dense) and token_classify (sparse) tasks are affected. The error kills the engine entirely (EngineDeadError), requiring a full container restart.
Single requests work fine. The crash occurs only after accumulating multiple requests.
Environment
- vLLM versions tested:
v0.15.1 and latest (v0.17.x as of 2026-03-20)
- Model:
BAAI/bge-m3 with --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'
- GPU: NVIDIA B200 (183GB)
- Docker image:
vllm/vllm-openai:v0.15.1 and vllm/vllm-openai:latest
Steps to Reproduce
# Launch
docker run --gpus '"device=0"' -d --network host \
--entrypoint vllm vllm/vllm-openai:v0.15.1 \
serve BAAI/bge-m3 --host 0.0.0.0 --port 9001 \
--trust-remote-code \
--hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'
# Wait for model to load, then send ~100 sequential requests
for i in $(seq 1 100); do
curl -s http://localhost:9001/pooling \
-H "Content-Type: application/json" \
-d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test sentence number '$i'"]}' > /dev/null
done
# Server crashes around request 50-100
Error Log
(EngineCore_DP0 pid=405) File ".../vllm/model_executor/layers/pooler/tokwise/methods.py", line 50, in forward
(EngineCore_DP0 pid=405) hidden_states_all = hidden_states.split(
(EngineCore_DP0 pid=405) ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=405) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 4192 (input tensor's size at dimension 0), but got split_sizes=[74, 43, 230, 510, 646, 348, 205, 77]
(APIServer pid=1) ERROR [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
Behavior
- Single requests to
/pooling (both embed and token_classify tasks) work correctly
- After ~50-100 sequential single-item requests, the engine crashes with
split_with_sizes mismatch
- The crash is fatal — the engine enters
EngineDeadError state and all subsequent requests return 500
- Batch requests (multiple items in
input array) crash immediately on the first request
- Input text length does not matter — crash occurs even with very short texts (~10 tokens)
Workaround
Restart the vLLM container after every ~50 requests. Deterministic point IDs allow resumable indexing.
Related
Description
The
/poolingendpoint for BGE-M3 (BgeM3EmbeddingModel) crashes after approximately 50-100 sequential requests. Bothembed(dense) andtoken_classify(sparse) tasks are affected. The error kills the engine entirely (EngineDeadError), requiring a full container restart.Single requests work fine. The crash occurs only after accumulating multiple requests.
Environment
v0.15.1andlatest(v0.17.x as of 2026-03-20)BAAI/bge-m3with--hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'vllm/vllm-openai:v0.15.1andvllm/vllm-openai:latestSteps to Reproduce
Error Log
Behavior
/pooling(bothembedandtoken_classifytasks) work correctlysplit_with_sizesmismatchEngineDeadErrorstate and all subsequent requests return 500inputarray) crash immediately on the first requestWorkaround
Restart the vLLM container after every ~50 requests. Deterministic point IDs allow resumable indexing.
Related