[data, rollout] feat: add audio data support#6276
Conversation
|
jaxlliu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for audio inputs across the agent loop, rollout workers, and dataset processing. It adds configuration options for audio keys and processor-specific arguments (mm_processor_kwargs), alongside utility functions for multimodal input building and token ID resolution. Feedback identifies a potential TypeError in the vLLM rollout server when falling back on unsupported processor arguments and highlights a naming inconsistency between plural and singular keys for multimodal data across different system components.
| prompt = TokensPrompt(**prompt_kwargs) | ||
| except TypeError: | ||
| prompt = prompt_kwargs |
There was a problem hiding this comment.
If TokensPrompt fails with a TypeError, it indicates that the current vLLM version does not support mm_processor_kwargs. Falling back to prompt = prompt_kwargs without removing this key will likely cause a subsequent TypeError when the vLLM engine attempts to process the prompt dictionary. It is safer to remove the unsupported key in the except block. This in-place modification avoids an unnecessary defensive copy in an async method.
| prompt = TokensPrompt(**prompt_kwargs) | |
| except TypeError: | |
| prompt = prompt_kwargs | |
| try: | |
| prompt = TokensPrompt(**prompt_kwargs) | |
| except TypeError: | |
| prompt_kwargs.pop("mm_processor_kwargs", None) | |
| prompt = prompt_kwargs |
References
- In an async method, avoid making an unnecessary defensive copy of a dictionary argument if it is guaranteed that the caller does not reuse the dictionary across concurrent calls.
| multi_modal_data["images"] = images | ||
| if videos is not None: | ||
| multi_modal_data["videos"] = videos | ||
| if audios is not None: | ||
| multi_modal_data["audios"] = audios |
There was a problem hiding this comment.
The keys used here ("images", "videos", "audios") are plural, but the rollout server and AsyncRolloutRequest use singular keys ("image", "video", "audio"). This inconsistency can lead to silent failures or missing data when multimodal payloads are passed between the agent loop and the rollout workers. Please unify the naming convention to use singular keys for internal multimodal data dictionaries, matching the rollout server and vLLM conventions.
|
I've just fixed the CI error. Could you please approve the workflows again? @vermouth1992 @wuxibin89 |
571a576 to
cdd12d6
Compare
|
@vermouth1992 @wuxibin89 I've resolved the conflict, could you please approve the workflow again? |
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
cdd12d6 to
11df606
Compare
Summary
Split from #6118. This PR adds generic audio data plumbing without adding Qwen3-Omni model-specific behavior:
data.audio_keysupport inRLHFDatasetand carry audio payloads asaudio_dataaudio_dataandmm_processor_kwargsthrough agent loop, teacher loop, rollout schemas, and vLLM async server request pathsRebase / CI Update
Rebased onto latest
mainatd324b014on 2026-05-12. The merge conflict was inverl/workers/rollout/llm_server.py; the resolution keeps the latestmainrollout-client layout where fully async resume generation lives inverl/experimental/fully_async_policy/fully_async_rollouter.py, while preserving this PR'saudio_data/mm_processor_kwargspropagation through both the generic and fully async LLM server clients.The old head
571a5760had GitHub CI failures only in Ascend/self-hosted jobs. The relevant logs pointed to runner/environment/resource failures rather than this PR's audio path:e2e_ascend: one job could not resolvesetuptools>=61.0during install; another failed with HCCL TCPStore wait timeout /ERR02005 DIST internal error.e2e_one_step_off_policy_ascend: Megatron Ascend job failed with NPU OOM while allocating a 1 GiB checkpoint-transfer buffer.e2e_ppo_trainer_megatron_vllm_2_ascend: FSDP/vLLM Ascend job exited with code 137 and then reportedcontainer not foundduring cleanup.After the latest rebase push, GitHub currently reports the PR as mergeable. No checks have been reported yet for head
11df606c; repository-side workflow approval may still be needed for this fork PR before CI executes.Duplicate-Work Check
No linked issue is used for this split PR. Duplicate searches run before this update:
gh pr list --repo verl-project/verl --state open --search "audio data support in:title,body"gh pr list --repo verl-project/verl --state open --search "audio_data mm_processor_kwargs in:body"gh pr list --repo verl-project/verl --state open --search "qwen3 omni audio in:title,body"Result:
audio_data mm_processor_kwargsonly finds this PR. The broader search finds #6118, #6277, and #3297, but they are materially different: #6118 is the original mixed PR being split, #6277 is the dependent Qwen3-Omni thinker follow-up, and #3297 is an older WIP/Draft model-specific Omni PR rather than this generic audio-data plumbing slice.Validation
git diff --checkPYTHONPATH=. python -m py_compile verl/experimental/fully_async_policy/fully_async_rollouter.py verl/workers/rollout/llm_server.pyPYTHONPATH=. python tests/special_sanity/check_api_docs.py verl(passed; optional dependency warnings only)PYTHONPATH=. pytest -q tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py tests/experimental/agent_loop/test_audio_server_contract_on_cpu.py tests/utils/test_audio_input_support_on_cpu.py(10 passed)pre-commit run --files $(git diff --name-only upstream/main...HEAD)(passed)AI assistance was used to prepare and update this PR. The human submitter should review every changed line and the CI status before merge.