[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region#37873
Conversation
…ition-ID accumulation RoBERTa-based models (BGE-M3, XLM-RoBERTa) call replace_roberta_positions outside the CUDA graph, which does an in-place position_ids += (padding_idx + 1) on the full padded positions tensor. copy_to_gpu only refreshes the first num_scheduled_tokens entries, leaving the padding slots [num_scheduled_tokens:num_input_tokens] with stale values from the previous forward pass. Each request accumulates another offset in those slots until the value exceeds max_position_embeddings and triggers a device-side assertion. Fix: zero out the stale padding region in _preprocess before the model is called so the offset always starts from 0 in those slots. Signed-off-by: dass90 <3053034939@qq.com>
d882a74 to
24192be
Compare
There was a problem hiding this comment.
Code Review
The pull request fixes a critical bug causing crashes in RoBERTa-based models when CUDA graphs are enabled. The root cause is an accumulation of values in the padding region of the positions buffer across requests, leading to an out-of-bounds access. The fix correctly addresses this by explicitly zeroing out the padding region of the buffer for each request. This change is logical, well-targeted, and effectively prevents the bug as described. The provided context and test plan are thorough, confirming the correctness of the solution.
|
cc @noooop |
|
cc @Isotr0py |
Isotr0py
left a comment
There was a problem hiding this comment.
Thanks! Look reasonable to me.
|
But seems #37884 would be a better fix which doesn't touch model runner? |
#37873 (This PR) protects against this class of bug for models which use |
vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com> (cherry picked from commit 7151ae6)
vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com>
vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com>
vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com>
vllm-project#37873) Signed-off-by: dass90 <3053034939@qq.com>
Fix a crash that affected all RoBERTa-based embedding models (BAAI/bge-m3, XLM-RoBERTa, stsb-roberta, bge-reranker-v2-m3) when CUDA graphs are enabled. After approximately
max_position_embeddings / 2requests the server crashes with:Assertion 'index out of bounds: 0 <= tmp25 < 8194' failed.gpu_model_runnerkeeps a persistent GPU bufferself.positionsthat is reused across every request. Each request refreshes only the firstnum_scheduled_tokensentries viacopy_to_gpu; the remaining padding slots[num_scheduled_tokens : num_input_tokens_padded]are not reset.replace_roberta_positionsoutside the CUDA graph (beforeBertModel.forward), which does an in-placeposition_ids += padding_idx + 1on the full padded tensor, including the stale padding slots. Because those slots are never reset bycopy_to_gpu, each request adds another+(padding_idx + 1)to them.For BAAI/bge-m3 (
max_position_embeddings = 8194,padding_idx = 1, offset = 2) with short sentences (6 tokens) padded to 8:Fixes #37868
Related #37648 #37868
Purpose
Test Plan
Reproduce the bug
Existing tests
pytest tests/models/language/pooling/test_bge_m3.py -v -s pytest tests/models/language/pooling/test_embedding.py -v -s -k "stsb-roberta" pytest tests/models/language/pooling/test_multi_vector_retrieval.py -v -s pytest tests/models/language/pooling/test_scoring.py -v -sTest Result
Tested on RTX 5090 (Blackwell, sm_120), vLLM 0.18.1rc1.dev38+ga16133a0f:
/poolingrequeststest_bge_m3.py(dense / sparse / ColBERT scores)test_embedding.py[stsb-roberta-base-v2]test_multi_vector_retrieval.pytest_scoring.py(bge-reranker-v2-m3)Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.