[model] fix: Align Qwen VLM Ulysses + fused-kernel paths and harden multimodal position-id / vision-embed handling#5948
Open
jamindy wants to merge 3 commits intoverl-project:mainfrom
Open
Conversation
fixes multimodal runtime issues for `qwen2_vl`, `qwen3_vl`, and `qwen3_5` under `FSDP2 + Ulysses SP + fused kernels`. Key fixes: - Align fused-kernel label handling after Ulysses slicing (avoid hidden/label mismatch). - Fix Qwen3-VL vision position-embedding path for `DTensor`/offload cases. - Refactor shared bilinear vision pos-embed interpolation into a common helper. - Remove hard-coded `rope_dim=4` assumptions and fix broken nested `position_ids` layout recovery. Please enter the commit message for your changes. Lines starting
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces significant updates to support Qwen3-VL models within the verl framework. Key changes include the implementation of a fast bilinear interpolation utility for vision position embeddings, monkey patching for device-safe position embedding lookups, and improved handling of 3D position IDs and sequence parallelism for vision-language models. Additionally, the PR refines loss calculation logic by removing unnecessary label rolling and adds robust input handling for multimodal sequence parallelism. I have no feedback to provide as all changes appear to be well-structured and address the stated requirements.
|
@jamindy Why was this PR closed? How to solve this issues? |
- **Qwen3.5 vision patch alignment**
- Aligns Qwen3.5 `fast_pos_embed_interpolate` patch behavior with Qwen3-VL
- Adds DTensor/sharded-weight-safe embedding lookup (`full_tensor()` + device-local `F.embedding`)
- Switches monkey patch wiring to a dedicated, idempotent patch entrypoint
- **3D nested `position_ids` robustness**
- Improves `maybe_fix_3d_position_ids` repair logic for broken jagged 3D layouts after TensorDict serialize/deserialize flows
- Uses `input_ids` offsets as repair targets when valid; otherwise keeps safe fallback behavior
- **Tests**
- Adds focused unit coverage in `tests/utils/test_padding_on_cpu.py` for:
- successful repair
- invalid-offset skip paths
- warning path for invalid target offsets
- empty-batch behavior
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR fixes multiple multimodal runtime issues across
qwen2_vl,qwen3_vl, andqwen3_5under Ulysses sequence parallelism + fused kernels.It includes:
FSDP2 + multimodalexecutionrope_dim=4assumptions and repairing broken nestedposition_idslayout handlingPreviously observed errors:
aten.index_select.default got mixed torch.Tensor and DTensorExpected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0The size of tensor a (xxx) must match the size of tensor b (4) at non-singleton dimension 2qwen3_vl/qwen3_5paths previously failed whenSP > 1in specific fused-kernel multimodal flowsRoot cause:
The monkey-patched Qwen VLM interpolation and SP/fused-kernel paths were not fully consistent with:
FSDP2sharded embedding weights (DTensor)This led to tensor type/device mismatches and hidden/label shape mismatches.
Related issues
Validation
On the latest
verlwithTransformers 5.5.0, we validated:qwen2_vlqwen3_vlqwen3_5All tests passed with Ulysses sequence parallelism and fused kernels enabled.
Test script (example)
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.