BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests

## Description

The `/pooling` endpoint for BGE-M3 (`BgeM3EmbeddingModel`) crashes after approximately 50-100 sequential requests. Both `embed` (dense) and `token_classify` (sparse) tasks are affected. The error kills the engine entirely (`EngineDeadError`), requiring a full container restart.

Single requests work fine. The crash occurs only after accumulating multiple requests.

## Environment

- **vLLM versions tested:** `v0.15.1` and `latest` (v0.17.x as of 2026-03-20)
- **Model:** `BAAI/bge-m3` with `--hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'`
- **GPU:** NVIDIA B200 (183GB)
- **Docker image:** `vllm/vllm-openai:v0.15.1` and `vllm/vllm-openai:latest`

## Steps to Reproduce

```bash
# Launch
docker run --gpus '"device=0"' -d --network host \
  --entrypoint vllm vllm/vllm-openai:v0.15.1 \
  serve BAAI/bge-m3 --host 0.0.0.0 --port 9001 \
  --trust-remote-code \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

# Wait for model to load, then send ~100 sequential requests
for i in $(seq 1 100); do
  curl -s http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test sentence number '$i'"]}' > /dev/null
done
# Server crashes around request 50-100
```

## Error Log

```
(EngineCore_DP0 pid=405)   File ".../vllm/model_executor/layers/pooler/tokwise/methods.py", line 50, in forward
(EngineCore_DP0 pid=405)     hidden_states_all = hidden_states.split(
(EngineCore_DP0 pid=405)                         ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=405) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 4192 (input tensor's size at dimension 0), but got split_sizes=[74, 43, 230, 510, 646, 348, 205, 77]
(APIServer pid=1) ERROR [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
```

## Behavior

- Single requests to `/pooling` (both `embed` and `token_classify` tasks) work correctly
- After ~50-100 sequential single-item requests, the engine crashes with `split_with_sizes` mismatch
- The crash is fatal — the engine enters `EngineDeadError` state and all subsequent requests return 500
- Batch requests (multiple items in `input` array) crash immediately on the first request
- Input text length does not matter — crash occurs even with very short texts (~10 tokens)

## Workaround

Restart the vLLM container after every ~50 requests. Deterministic point IDs allow resumable indexing.

## Related

- BGE-M3 support PR: #14526

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests #37648

Description

Environment

Steps to Reproduce

Error Log

Behavior

Workaround

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests #37648

Description

Description

Environment

Steps to Reproduce

Error Log

Behavior

Workaround

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions