[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region by yanghui1-arch · Pull Request #37873 · vllm-project/vllm

yanghui1-arch · 2026-03-23T09:25:16Z

Fix a crash that affected all RoBERTa-based embedding models (BAAI/bge-m3, XLM-RoBERTa, stsb-roberta, bge-reranker-v2-m3) when CUDA graphs are enabled. After approximately max_position_embeddings / 2 requests the server crashes with: Assertion 'index out of bounds: 0 <= tmp25 < 8194' failed.

gpu_model_runner keeps a persistent GPU buffer self.positions that is reused across every request. Each request refreshes only the first num_scheduled_tokens entries via copy_to_gpu; the remaining padding slots [num_scheduled_tokens : num_input_tokens_padded] are not reset.
RoBERTa models call replace_roberta_positions outside the CUDA graph (before BertModel.forward), which does an in-place position_ids += padding_idx + 1 on the full padded tensor, including the stale padding slots. Because those slots are never reset by copy_to_gpu, each request adds another +(padding_idx + 1) to them.

For BAAI/bge-m3 (max_position_embeddings = 8194, padding_idx = 1, offset = 2) with short sentences (6 tokens) padded to 8:

padding slot value after K requests  =  V_init + 2K
overflow when                         >= 8194
K                                     = (8194 - V_init) / 2  ≈  3999

Fixes #37868

Related #37648 #37868

Purpose

Test Plan

Reproduce the bug

CUDA_LAUNCH_BLOCKING=1 vllm serve BAAI/bge-m3 --port 9001 \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}' \
  --runner pooling

# Send 10000 sequential requests
for i in $(seq 1 10000); do
  response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed",
         "input": ["test sentence number '$i'"]}')
  if [ "$response" != "200" ]; then
    echo "FAILED at index $i (HTTP $response)"; break
  fi
done

Existing tests

pytest tests/models/language/pooling/test_bge_m3.py -v -s
pytest tests/models/language/pooling/test_embedding.py -v -s -k "stsb-roberta"
pytest tests/models/language/pooling/test_multi_vector_retrieval.py -v -s
pytest tests/models/language/pooling/test_scoring.py -v -s

Test Result

Tested on RTX 5090 (Blackwell, sm_120), vLLM 0.18.1rc1.dev38+ga16133a0f:

	Before fix	After fix
10000 sequential BGE-M3 `/pooling` requests	Crash at index ~3999	All pass
`test_bge_m3.py` (dense / sparse / ColBERT scores)	Pass	Pass
`test_embedding.py[stsb-roberta-base-v2]`	Pass	Pass
`test_multi_vector_retrieval.py`	Pass	Pass
`test_scoring.py` (bge-reranker-v2-m3)	Pass	Pass

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…ition-ID accumulation RoBERTa-based models (BGE-M3, XLM-RoBERTa) call replace_roberta_positions outside the CUDA graph, which does an in-place position_ids += (padding_idx + 1) on the full padded positions tensor. copy_to_gpu only refreshes the first num_scheduled_tokens entries, leaving the padding slots [num_scheduled_tokens:num_input_tokens] with stale values from the previous forward pass. Each request accumulates another offset in those slots until the value exceeds max_position_embeddings and triggers a device-side assertion. Fix: zero out the stale padding region in _preprocess before the model is called so the offset always starts from 0 in those slots. Signed-off-by: dass90 <3053034939@qq.com>

gemini-code-assist

Code Review

The pull request fixes a critical bug causing crashes in RoBERTa-based models when CUDA graphs are enabled. The root cause is an accumulation of values in the padding region of the positions buffer across requests, leading to an out-of-bounds access. The fix correctly addresses this by explicitly zeroing out the padding region of the buffer for each request. This change is logical, well-targeted, and effectively prevents the bug as described. The provided context and test plan are thorough, confirming the correctness of the solution.

yanghui1-arch · 2026-03-23T09:35:55Z

cc @noooop

noooop · 2026-03-23T10:47:05Z

cc @Isotr0py

Isotr0py

Thanks! Look reasonable to me.

Isotr0py · 2026-03-23T11:53:03Z

But seems #37884 would be a better fix which doesn't touch model runner?

yanghui1-arch · 2026-03-23T12:56:15Z

But seems #37884 would be a better fix which doesn't touch model runner?

#37873 (This PR) protects against this class of bug for models which use positions += offset inner. #37884 can't solve the bug which will be caused by future models which use positions += offset. It's possible that the same issue will appear again in the future due to new models.

vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com> (cherry picked from commit 7151ae6)

vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com>

vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com>

yanghui1-arch requested a review from njhill as a code owner March 23, 2026 09:25

mergify Bot added nvidia v1 bug Something isn't working labels Mar 23, 2026

github-project-automation Bot added this to NVIDIA Mar 23, 2026

yanghui1-arch force-pushed the fix/roberta-position-accumulation-cuda-graph branch from d882a74 to 24192be Compare March 23, 2026 09:28

gemini-code-assist Bot reviewed Mar 23, 2026

View reviewed changes

yanghui1-arch changed the title ~~[Bugfix] RoBERTa position-ID accumulation in CUDA graph padding region~~ [Bugfix] RoBERTa position_id accumulation in CUDA graph padding region Mar 23, 2026

noooop requested a review from Isotr0py March 23, 2026 10:46

This was referenced Mar 23, 2026

BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests #37648

Closed

[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding #37884

Merged

Isotr0py approved these changes Mar 23, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Mar 23, 2026

Isotr0py enabled auto-merge (squash) March 23, 2026 11:47

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 23, 2026

Merge branch 'main' into fix/roberta-position-accumulation-cuda-graph

e0900d9

Isotr0py disabled auto-merge March 23, 2026 11:51

Isotr0py enabled auto-merge (squash) March 23, 2026 13:00

Merge branch 'main' into fix/roberta-position-accumulation-cuda-graph

8788459

Isotr0py merged commit 7151ae6 into vllm-project:main Mar 23, 2026
51 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Mar 23, 2026

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 26, 2026

[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region (

3dfb889

vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com> (cherry picked from commit 7151ae6)

RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026

[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region (

f9ba5bd

vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region (

3c719bf

vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com>

nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026

[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region (

7346511

vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region (

cdc0499

vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com>

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region (

460f3cb

vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region#37873

[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region#37873
Isotr0py merged 3 commits intovllm-project:mainfrom
yanghui1-arch:fix/roberta-position-accumulation-cuda-graph

yanghui1-arch commented Mar 23, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

yanghui1-arch commented Mar 23, 2026 •

edited

Loading

Uh oh!

noooop commented Mar 23, 2026

Uh oh!

Isotr0py left a comment

Uh oh!

Isotr0py commented Mar 23, 2026

Uh oh!

yanghui1-arch commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yanghui1-arch commented Mar 23, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Reproduce the bug

Existing tests

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

yanghui1-arch commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Mar 23, 2026

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Isotr0py commented Mar 23, 2026

Uh oh!

yanghui1-arch commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yanghui1-arch commented Mar 23, 2026 •

edited by github-actions Bot

Loading

yanghui1-arch commented Mar 23, 2026 •

edited

Loading