[data, rollout] feat: add audio data support by SanftMonster · Pull Request #6276 · verl-project/verl

SanftMonster · 2026-05-08T09:27:06Z

Summary

Split from #6118. This PR adds generic audio data plumbing without adding Qwen3-Omni model-specific behavior:

add data.audio_key support in RLHFDataset and carry audio payloads as audio_data
thread audio_data and mm_processor_kwargs through agent loop, teacher loop, rollout schemas, and vLLM async server request paths
extend multimodal processor input construction to include audio payloads and processor sampling rate
make TRT-LLM fail explicitly for audio inputs because the async TRT-LLM path does not support audio yet
add CPU tests for audio dataset processing and the agent-loop server contract

Rebase / CI Update

Rebased onto latest main at d324b014 on 2026-05-12. The merge conflict was in verl/workers/rollout/llm_server.py; the resolution keeps the latest main rollout-client layout where fully async resume generation lives in verl/experimental/fully_async_policy/fully_async_rollouter.py, while preserving this PR's audio_data / mm_processor_kwargs propagation through both the generic and fully async LLM server clients.

The old head 571a5760 had GitHub CI failures only in Ascend/self-hosted jobs. The relevant logs pointed to runner/environment/resource failures rather than this PR's audio path:

e2e_ascend: one job could not resolve setuptools>=61.0 during install; another failed with HCCL TCPStore wait timeout / ERR02005 DIST internal error.
e2e_one_step_off_policy_ascend: Megatron Ascend job failed with NPU OOM while allocating a 1 GiB checkpoint-transfer buffer.
e2e_ppo_trainer_megatron_vllm_2_ascend: FSDP/vLLM Ascend job exited with code 137 and then reported container not found during cleanup.

After the latest rebase push, GitHub currently reports the PR as mergeable. No checks have been reported yet for head 11df606c; repository-side workflow approval may still be needed for this fork PR before CI executes.

Duplicate-Work Check

No linked issue is used for this split PR. Duplicate searches run before this update:

gh pr list --repo verl-project/verl --state open --search "audio data support in:title,body"
gh pr list --repo verl-project/verl --state open --search "audio_data mm_processor_kwargs in:body"
gh pr list --repo verl-project/verl --state open --search "qwen3 omni audio in:title,body"

Result: audio_data mm_processor_kwargs only finds this PR. The broader search finds #6118, #6277, and #3297, but they are materially different: #6118 is the original mixed PR being split, #6277 is the dependent Qwen3-Omni thinker follow-up, and #3297 is an older WIP/Draft model-specific Omni PR rather than this generic audio-data plumbing slice.

Validation

git diff --check
PYTHONPATH=. python -m py_compile verl/experimental/fully_async_policy/fully_async_rollouter.py verl/workers/rollout/llm_server.py
PYTHONPATH=. python tests/special_sanity/check_api_docs.py verl (passed; optional dependency warnings only)
PYTHONPATH=. pytest -q tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py tests/experimental/agent_loop/test_audio_server_contract_on_cpu.py tests/utils/test_audio_input_support_on_cpu.py (10 passed)
pre-commit run --files $(git diff --name-only upstream/main...HEAD) (passed)

AI assistance was used to prepare and update this PR. The human submitter should review every changed line and the CI status before merge.

CLAassistant · 2026-05-08T09:27:18Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

jaxlliu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request introduces comprehensive support for audio inputs across the agent loop, rollout workers, and dataset processing. It adds configuration options for audio keys and processor-specific arguments (mm_processor_kwargs), alongside utility functions for multimodal input building and token ID resolution. Feedback identifies a potential TypeError in the vLLM rollout server when falling back on unsupported processor arguments and highlights a naming inconsistency between plural and singular keys for multimodal data across different system components.

gemini-code-assist · 2026-05-08T09:35:38Z

+            prompt = TokensPrompt(**prompt_kwargs)
+        except TypeError:
+            prompt = prompt_kwargs


If TokensPrompt fails with a TypeError, it indicates that the current vLLM version does not support mm_processor_kwargs. Falling back to prompt = prompt_kwargs without removing this key will likely cause a subsequent TypeError when the vLLM engine attempts to process the prompt dictionary. It is safer to remove the unsupported key in the except block. This in-place modification avoids an unnecessary defensive copy in an async method.

Suggested change

prompt = TokensPrompt(**prompt_kwargs)

except TypeError:

prompt = prompt_kwargs

try:

prompt = TokensPrompt(**prompt_kwargs)

except TypeError:

prompt_kwargs.pop("mm_processor_kwargs", None)

prompt = prompt_kwargs

References

In an async method, avoid making an unnecessary defensive copy of a dictionary argument if it is guaranteed that the caller does not reuse the dictionary across concurrent calls.

gemini-code-assist · 2026-05-08T09:35:38Z

                multi_modal_data["images"] = images
            if videos is not None:
                multi_modal_data["videos"] = videos
+            if audios is not None:
+                multi_modal_data["audios"] = audios


The keys used here ("images", "videos", "audios") are plural, but the rollout server and AsyncRolloutRequest use singular keys ("image", "video", "audio"). This inconsistency can lead to silent failures or missing data when multimodal payloads are passed between the agent loop and the rollout workers. Please unify the naming convention to use singular keys for internal multimodal data dictionaries, matching the rollout server and vLLM conventions.

SanftMonster · 2026-05-08T10:37:50Z

I've just fixed the CI error. Could you please approve the workflows again? @vermouth1992 @wuxibin89

SanftMonster · 2026-05-11T07:24:32Z

@vermouth1992 @wuxibin89 I've resolved the conflict, could you please approve the workflow again?

Co-authored-by: OpenAI Codex <codex@openai.com>

SanftMonster mentioned this pull request May 8, 2026

[model, rollout] feat: add qwen3-omni thinker support #6277

Draft

SanftMonster marked this pull request as ready for review May 8, 2026 09:32

SanftMonster requested review from ArronHZG, PeterSH6, chenhaiq, eric-haibin-lin, tongyx361, vermouth1992 and wuxibin89 as code owners May 8, 2026 09:32

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

SanftMonster force-pushed the codex/audio-data-support branch from 571a576 to cdd12d6 Compare May 11, 2026 07:16

jaxlliu and others added 2 commits May 12, 2026 14:25

feat: add audio data support

4082e0b

Co-authored-by: OpenAI Codex <codex@openai.com>

fix: address audio data CI regressions

11df606

Co-authored-by: OpenAI Codex <codex@openai.com>

SanftMonster force-pushed the codex/audio-data-support branch from cdd12d6 to 11df606 Compare May 12, 2026 06:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data, rollout] feat: add audio data support#6276

[data, rollout] feat: add audio data support#6276
SanftMonster wants to merge 2 commits into
verl-project:mainfrom
SanftMonster:codex/audio-data-support

SanftMonster commented May 8, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

SanftMonster commented May 8, 2026

Uh oh!

SanftMonster commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SanftMonster commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Rebase / CI Update

Duplicate-Work Check

Validation

Uh oh!

CLAassistant commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

SanftMonster commented May 8, 2026

Uh oh!

SanftMonster commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SanftMonster commented May 8, 2026 •

edited

Loading