Skip to content

[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region#37873

Merged
Isotr0py merged 3 commits intovllm-project:mainfrom
yanghui1-arch:fix/roberta-position-accumulation-cuda-graph
Mar 23, 2026
Merged

[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region#37873
Isotr0py merged 3 commits intovllm-project:mainfrom
yanghui1-arch:fix/roberta-position-accumulation-cuda-graph

Conversation

@yanghui1-arch
Copy link
Copy Markdown
Contributor

@yanghui1-arch yanghui1-arch commented Mar 23, 2026

Fix a crash that affected all RoBERTa-based embedding models (BAAI/bge-m3, XLM-RoBERTa, stsb-roberta, bge-reranker-v2-m3) when CUDA graphs are enabled. After approximately max_position_embeddings / 2 requests the server crashes with: Assertion 'index out of bounds: 0 <= tmp25 < 8194' failed.

  • gpu_model_runner keeps a persistent GPU buffer self.positions that is reused across every request. Each request refreshes only the first num_scheduled_tokens entries via copy_to_gpu; the remaining padding slots [num_scheduled_tokens : num_input_tokens_padded] are not reset.
  • RoBERTa models call replace_roberta_positions outside the CUDA graph (before BertModel.forward), which does an in-place position_ids += padding_idx + 1 on the full padded tensor, including the stale padding slots. Because those slots are never reset by copy_to_gpu, each request adds another +(padding_idx + 1) to them.

For BAAI/bge-m3 (max_position_embeddings = 8194, padding_idx = 1, offset = 2) with short sentences (6 tokens) padded to 8:

padding slot value after K requests  =  V_init + 2K
overflow when                         >= 8194
K                                     = (8194 - V_init) / 2  ≈  3999

Fixes #37868

Related #37648 #37868

Purpose

Test Plan

Reproduce the bug

CUDA_LAUNCH_BLOCKING=1 vllm serve BAAI/bge-m3 --port 9001 \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}' \
  --runner pooling

# Send 10000 sequential requests
for i in $(seq 1 10000); do
  response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed",
         "input": ["test sentence number '$i'"]}')
  if [ "$response" != "200" ]; then
    echo "FAILED at index $i (HTTP $response)"; break
  fi
done

Existing tests

pytest tests/models/language/pooling/test_bge_m3.py -v -s
pytest tests/models/language/pooling/test_embedding.py -v -s -k "stsb-roberta"
pytest tests/models/language/pooling/test_multi_vector_retrieval.py -v -s
pytest tests/models/language/pooling/test_scoring.py -v -s

Test Result

Tested on RTX 5090 (Blackwell, sm_120), vLLM 0.18.1rc1.dev38+ga16133a0f:

Before fix After fix
10000 sequential BGE-M3 /pooling requests Crash at index ~3999 All pass
test_bge_m3.py (dense / sparse / ColBERT scores) Pass Pass
test_embedding.py[stsb-roberta-base-v2] Pass Pass
test_multi_vector_retrieval.py Pass Pass
test_scoring.py (bge-reranker-v2-m3) Pass Pass

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@yanghui1-arch yanghui1-arch requested a review from njhill as a code owner March 23, 2026 09:25
@mergify mergify Bot added nvidia v1 bug Something isn't working labels Mar 23, 2026
…ition-ID accumulation

RoBERTa-based models (BGE-M3, XLM-RoBERTa) call
replace_roberta_positions outside the CUDA graph, which does an
in-place position_ids += (padding_idx + 1) on the full padded
positions tensor. copy_to_gpu only refreshes the first
num_scheduled_tokens entries, leaving the padding slots
[num_scheduled_tokens:num_input_tokens] with stale values from the
previous forward pass. Each request accumulates another offset in
those slots until the value exceeds max_position_embeddings and
triggers a device-side assertion.

Fix: zero out the stale padding region in _preprocess before the
model is called so the offset always starts from 0 in those slots.

Signed-off-by: dass90 <3053034939@qq.com>
@yanghui1-arch yanghui1-arch force-pushed the fix/roberta-position-accumulation-cuda-graph branch from d882a74 to 24192be Compare March 23, 2026 09:28
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request fixes a critical bug causing crashes in RoBERTa-based models when CUDA graphs are enabled. The root cause is an accumulation of values in the padding region of the positions buffer across requests, leading to an out-of-bounds access. The fix correctly addresses this by explicitly zeroing out the padding region of the buffer for each request. This change is logical, well-targeted, and effectively prevents the bug as described. The provided context and test plan are thorough, confirming the correctness of the solution.

@yanghui1-arch
Copy link
Copy Markdown
Contributor Author

yanghui1-arch commented Mar 23, 2026

cc @noooop

@yanghui1-arch yanghui1-arch changed the title [Bugfix] RoBERTa position-ID accumulation in CUDA graph padding region [Bugfix] RoBERTa position_id accumulation in CUDA graph padding region Mar 23, 2026
@noooop noooop requested a review from Isotr0py March 23, 2026 10:46
@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Mar 23, 2026

cc @Isotr0py

Copy link
Copy Markdown
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Look reasonable to me.

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Mar 23, 2026
@Isotr0py Isotr0py enabled auto-merge (squash) March 23, 2026 11:47
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 23, 2026
@Isotr0py Isotr0py disabled auto-merge March 23, 2026 11:51
@Isotr0py
Copy link
Copy Markdown
Member

But seems #37884 would be a better fix which doesn't touch model runner?

@yanghui1-arch
Copy link
Copy Markdown
Contributor Author

But seems #37884 would be a better fix which doesn't touch model runner?

#37873 (This PR) protects against this class of bug for models which use positions += offset inner. #37884 can't solve the bug which will be caused by future models which use positions += offset. It's possible that the same issue will appear again in the future due to new models.

@Isotr0py Isotr0py enabled auto-merge (squash) March 23, 2026 13:00
@Isotr0py Isotr0py merged commit 7151ae6 into vllm-project:main Mar 23, 2026
51 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Mar 23, 2026
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 26, 2026
vllm-project#37873)

Signed-off-by: dass90 <3053034939@qq.com>
(cherry picked from commit 7151ae6)
RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
vllm-project#37873)

Signed-off-by: dass90 <3053034939@qq.com>

Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: bge-m3 /pooling endpoint breaks in the latest main branch

3 participants