Skip to content

[model] fix: Align Qwen VLM Ulysses + fused-kernel paths and harden multimodal position-id / vision-embed handling#5948

Open
jamindy wants to merge 3 commits intoverl-project:mainfrom
jamindy:fix_qwen3_vl_vision_embedding
Open

[model] fix: Align Qwen VLM Ulysses + fused-kernel paths and harden multimodal position-id / vision-embed handling#5948
jamindy wants to merge 3 commits intoverl-project:mainfrom
jamindy:fix_qwen3_vl_vision_embedding

Conversation

@jamindy
Copy link
Copy Markdown

@jamindy jamindy commented Apr 9, 2026

What does this PR do?

This PR fixes multiple multimodal runtime issues across qwen2_vl, qwen3_vl, and qwen3_5 under Ulysses sequence parallelism + fused kernels.

It includes:

  • fixing Qwen3-VL vision position-embedding behavior under FSDP2 + multimodal execution
  • refactoring shared bilinear vision position-embedding interpolation logic for Qwen3.5 and Qwen3-VL into a common helper
  • fixing Ulysses SP + fused-kernel label alignment errors in VLM forward paths
  • removing hard-coded rope_dim=4 assumptions and repairing broken nested position_ids layout handling

Previously observed errors:

  1. aten.index_select.default got mixed torch.Tensor and DTensor
  2. Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0
  3. The size of tensor a (xxx) must match the size of tensor b (4) at non-singleton dimension 2
  4. qwen3_vl / qwen3_5 paths previously failed when SP > 1 in specific fused-kernel multimodal flows

Root cause:
The monkey-patched Qwen VLM interpolation and SP/fused-kernel paths were not fully consistent with:

  • FSDP2 sharded embedding weights (DTensor)
  • device/offload behavior
  • and local-label alignment requirements after Ulysses slicing
    This led to tensor type/device mismatches and hidden/label shape mismatches.

Related issues

Validation

On the latest verl with Transformers 5.5.0, we validated:

  • qwen2_vl
  • qwen3_vl
  • qwen3_5

All tests passed with Ulysses sequence parallelism and fused kernels enabled.

Test script (example)

python3 -m verl.trainer.main_ppo \
  algorithm.adv_estimator=grpo \
  data.train_files=$train_files \
  data.val_files=$test_files \
  data.train_batch_size=4 \
  data.max_prompt_length=1024 \
  data.max_response_length=2048 \
  data.image_key=images \
  data.truncation=error \
  actor_rollout_ref.model.path=/models/Qwen3-VL-4B-Instruct \
  actor_rollout_ref.model.use_remove_padding=True \
  actor_rollout_ref.model.use_fused_kernels=True \
  actor_rollout_ref.model.fused_kernel_options.impl_backend=torch or triton \
  actor_rollout_ref.actor.strategy=fsdp2 \
  actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
  actor_rollout_ref.actor.fsdp_config.param_offload=True \
  actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
  actor_rollout_ref.actor.ulysses_sequence_parallel_size=2 \
  actor_rollout_ref.rollout.name=vllm \
  trainer.n_gpus_per_node=4 \
  trainer.nnodes=1 \
  trainer.total_epochs=1

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

fixes multimodal runtime issues for `qwen2_vl`, `qwen3_vl`, and `qwen3_5` under `FSDP2 + Ulysses SP + fused kernels`.

Key fixes:
- Align fused-kernel label handling after Ulysses slicing (avoid hidden/label mismatch).
- Fix Qwen3-VL vision position-embedding path for `DTensor`/offload cases.
- Refactor shared bilinear vision pos-embed interpolation into a common helper.
- Remove hard-coded `rope_dim=4` assumptions and fix broken nested `position_ids` layout recovery. Please enter the commit message for your changes. Lines starting
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 9, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant updates to support Qwen3-VL models within the verl framework. Key changes include the implementation of a fast bilinear interpolation utility for vision position embeddings, monkey patching for device-safe position embedding lookups, and improved handling of 3D position IDs and sequence parallelism for vision-language models. Additionally, the PR refines loss calculation logic by removing unnecessary label rolling and adds robust input handling for multimodal sequence parallelism. I have no feedback to provide as all changes appear to be well-structured and address the stated requirements.

@Kirrito-k423 Kirrito-k423 mentioned this pull request Apr 10, 2026
4 tasks
@jamindy jamindy closed this Apr 14, 2026
@jamindy jamindy deleted the fix_qwen3_vl_vision_embedding branch April 14, 2026 03:50
@longboat2010
Copy link
Copy Markdown

@jamindy Why was this PR closed? How to solve this issues?

@jamindy jamindy restored the fix_qwen3_vl_vision_embedding branch April 14, 2026 09:38
@jamindy jamindy reopened this Apr 14, 2026
@jamindy jamindy requested a review from wucong25 as a code owner April 14, 2026 09:50
- **Qwen3.5 vision patch alignment**
  - Aligns Qwen3.5 `fast_pos_embed_interpolate` patch behavior with Qwen3-VL
  - Adds DTensor/sharded-weight-safe embedding lookup (`full_tensor()` + device-local `F.embedding`)
  - Switches monkey patch wiring to a dedicated, idempotent patch entrypoint
- **3D nested `position_ids` robustness**
  - Improves `maybe_fix_3d_position_ids` repair logic for broken jagged 3D layouts after TensorDict serialize/deserialize flows
  - Uses `input_ids` offsets as repair targets when valid; otherwise keeps safe fallback behavior
- **Tests**
  - Adds focused unit coverage in `tests/utils/test_padding_on_cpu.py` for:
    - successful repair
    - invalid-offset skip paths
    - warning path for invalid target offsets
    - empty-batch behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants