[model] feat: add audio data path and qwen3-omni model support.#6118
[model] feat: add audio data path and qwen3-omni model support.#6118SanftMonster wants to merge 13 commits into
Conversation
|
jaxlliu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Qwen3-Omni model family, specifically the 'Thinker' sub-model, and extends the multimodal pipeline to support audio data alongside images and videos. Key changes include a monkey patch for the Qwen3-Omni MoE block to ensure gradient consistency under FSDP2, updates to agent loops and datasets for audio handling, and a new rule-based reward function for OmniInstruct. Review feedback identifies a potential runtime error in the MoE patch when top_k=1 and points out redundant weight normalization logic in the vLLM rollout worker.
| # a no-op, so the arithmetic is unchanged. | ||
| for expert_idx in range(self.num_experts): | ||
| expert_layer = self.experts[expert_idx] | ||
| idx, top_x = torch.where(expert_mask[expert_idx].squeeze(0)) |
There was a problem hiding this comment.
The use of .squeeze(0) here will cause a ValueError during unpacking if top_k is equal to 1. When top_k=1, the tensor becomes 1D after squeezing, and torch.where returns a single tensor instead of the expected two. Removing .squeeze(0) makes the logic robust for any top_k >= 1 as torch.where on a 2D tensor always returns indices for both dimensions.
| idx, top_x = torch.where(expert_mask[expert_idx].squeeze(0)) | |
| idx, top_x = torch.where(expert_mask[expert_idx]) |
| model = self.model_runner.model | ||
| if getattr(model, "__class__", type(model)).__name__ == "CUDAGraphWrapper" and hasattr(model, "unwrap"): | ||
| model = model.unwrap() | ||
| if model.__class__.__name__ == "Qwen3OmniMoeThinkerForConditionalGeneration": | ||
| weights = [ | ||
| (f"thinker.{name}" if not name.startswith("thinker.") else name, tensor) for name, tensor in weights | ||
| ] | ||
| self.load_weights(weights) |
There was a problem hiding this comment.
This block is redundant and potentially inconsistent. Weight name normalization for Qwen3OmniMoeThinkerForConditionalGeneration is already handled by the call to self._normalize_weight_names(weights) at line 299. Additionally, the manual unwrapping of CUDAGraphWrapper is duplicated logic that is already encapsulated in self._get_weight_sync_model().
| model = self.model_runner.model | |
| if getattr(model, "__class__", type(model)).__name__ == "CUDAGraphWrapper" and hasattr(model, "unwrap"): | |
| model = model.unwrap() | |
| if model.__class__.__name__ == "Qwen3OmniMoeThinkerForConditionalGeneration": | |
| weights = [ | |
| (f"thinker.{name}" if not name.startswith("thinker.") else name, tensor) for name, tensor in weights | |
| ] | |
| self.load_weights(weights) | |
| self.load_weights(weights) |
|
Generative RL has been moved to verl-project/verl-omni |
Co-authored-by: OpenAI Codex <codex@openai.com>
|
@wuxibin89 OK, what about splitting this PR into two pieces? One is audio-data support in verl, the other is qwen3-omni thinker support in verl-omni. |
What does this PR do?
Add audio-input support and Qwen3-Omni Thinker model support to verl.
This PR extends the multimodal data path from image/video-only to image/video/audio, including dataset parsing, processor kwargs propagation, agent/teacher loop plumbing, rollout server request handling, and Qwen3-Omni mRoPE position-id construction. It also adds Qwen3-Omni Thinker-specific compatibility patches for transformers/vLLM weight loading, multimodal placeholder deduplication, and FSDP input casting so audio rollout and actor forward paths stay aligned.
Regression coverage is added for audio processor inputs, Qwen3-Omni multimodal fields, nested 3D
position_ids, and padding conversion behavior.Verification
To find the evidence of the implementation correctness, I ran the GSPO training of qwen3-omni-30B-A3B-instruct thinker. I used DailyOmni as the val dataset and omni-instruct as train set.
Firstly, the val results show similar metrics to that in qwen3-omni tech report
Secondly, the rollout-trainer logprobs diff stays in a reasonable range, as shown below.
Finally, the actor & reward metrics looks OK during the 120-steps training, as shown below.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,vllm_omni,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.