Skip to content

[data, rollout] feat: add audio data support#6276

Open
SanftMonster wants to merge 2 commits into
verl-project:mainfrom
SanftMonster:codex/audio-data-support
Open

[data, rollout] feat: add audio data support#6276
SanftMonster wants to merge 2 commits into
verl-project:mainfrom
SanftMonster:codex/audio-data-support

Conversation

@SanftMonster
Copy link
Copy Markdown
Contributor

@SanftMonster SanftMonster commented May 8, 2026

Summary

Split from #6118. This PR adds generic audio data plumbing without adding Qwen3-Omni model-specific behavior:

  • add data.audio_key support in RLHFDataset and carry audio payloads as audio_data
  • thread audio_data and mm_processor_kwargs through agent loop, teacher loop, rollout schemas, and vLLM async server request paths
  • extend multimodal processor input construction to include audio payloads and processor sampling rate
  • make TRT-LLM fail explicitly for audio inputs because the async TRT-LLM path does not support audio yet
  • add CPU tests for audio dataset processing and the agent-loop server contract

Rebase / CI Update

Rebased onto latest main at d324b014 on 2026-05-12. The merge conflict was in verl/workers/rollout/llm_server.py; the resolution keeps the latest main rollout-client layout where fully async resume generation lives in verl/experimental/fully_async_policy/fully_async_rollouter.py, while preserving this PR's audio_data / mm_processor_kwargs propagation through both the generic and fully async LLM server clients.

The old head 571a5760 had GitHub CI failures only in Ascend/self-hosted jobs. The relevant logs pointed to runner/environment/resource failures rather than this PR's audio path:

  • e2e_ascend: one job could not resolve setuptools>=61.0 during install; another failed with HCCL TCPStore wait timeout / ERR02005 DIST internal error.
  • e2e_one_step_off_policy_ascend: Megatron Ascend job failed with NPU OOM while allocating a 1 GiB checkpoint-transfer buffer.
  • e2e_ppo_trainer_megatron_vllm_2_ascend: FSDP/vLLM Ascend job exited with code 137 and then reported container not found during cleanup.

After the latest rebase push, GitHub currently reports the PR as mergeable. No checks have been reported yet for head 11df606c; repository-side workflow approval may still be needed for this fork PR before CI executes.

Duplicate-Work Check

No linked issue is used for this split PR. Duplicate searches run before this update:

  • gh pr list --repo verl-project/verl --state open --search "audio data support in:title,body"
  • gh pr list --repo verl-project/verl --state open --search "audio_data mm_processor_kwargs in:body"
  • gh pr list --repo verl-project/verl --state open --search "qwen3 omni audio in:title,body"

Result: audio_data mm_processor_kwargs only finds this PR. The broader search finds #6118, #6277, and #3297, but they are materially different: #6118 is the original mixed PR being split, #6277 is the dependent Qwen3-Omni thinker follow-up, and #3297 is an older WIP/Draft model-specific Omni PR rather than this generic audio-data plumbing slice.

Validation

  • git diff --check
  • PYTHONPATH=. python -m py_compile verl/experimental/fully_async_policy/fully_async_rollouter.py verl/workers/rollout/llm_server.py
  • PYTHONPATH=. python tests/special_sanity/check_api_docs.py verl (passed; optional dependency warnings only)
  • PYTHONPATH=. pytest -q tests/experimental/agent_loop/test_agent_loop_extra_fields_schema_on_cpu.py tests/experimental/agent_loop/test_audio_server_contract_on_cpu.py tests/utils/test_audio_input_support_on_cpu.py (10 passed)
  • pre-commit run --files $(git diff --name-only upstream/main...HEAD) (passed)

AI assistance was used to prepare and update this PR. The human submitter should review every changed line and the CI status before merge.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


jaxlliu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for audio inputs across the agent loop, rollout workers, and dataset processing. It adds configuration options for audio keys and processor-specific arguments (mm_processor_kwargs), alongside utility functions for multimodal input building and token ID resolution. Feedback identifies a potential TypeError in the vLLM rollout server when falling back on unsupported processor arguments and highlights a naming inconsistency between plural and singular keys for multimodal data across different system components.

Comment on lines +506 to +508
prompt = TokensPrompt(**prompt_kwargs)
except TypeError:
prompt = prompt_kwargs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If TokensPrompt fails with a TypeError, it indicates that the current vLLM version does not support mm_processor_kwargs. Falling back to prompt = prompt_kwargs without removing this key will likely cause a subsequent TypeError when the vLLM engine attempts to process the prompt dictionary. It is safer to remove the unsupported key in the except block. This in-place modification avoids an unnecessary defensive copy in an async method.

Suggested change
prompt = TokensPrompt(**prompt_kwargs)
except TypeError:
prompt = prompt_kwargs
try:
prompt = TokensPrompt(**prompt_kwargs)
except TypeError:
prompt_kwargs.pop("mm_processor_kwargs", None)
prompt = prompt_kwargs
References
  1. In an async method, avoid making an unnecessary defensive copy of a dictionary argument if it is guaranteed that the caller does not reuse the dictionary across concurrent calls.

Comment on lines 264 to +268
multi_modal_data["images"] = images
if videos is not None:
multi_modal_data["videos"] = videos
if audios is not None:
multi_modal_data["audios"] = audios
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The keys used here ("images", "videos", "audios") are plural, but the rollout server and AsyncRolloutRequest use singular keys ("image", "video", "audio"). This inconsistency can lead to silent failures or missing data when multimodal payloads are passed between the agent loop and the rollout workers. Please unify the naming convention to use singular keys for internal multimodal data dictionaries, matching the rollout server and vLLM conventions.

@SanftMonster
Copy link
Copy Markdown
Contributor Author

I've just fixed the CI error. Could you please approve the workflows again? @vermouth1992 @wuxibin89

@SanftMonster SanftMonster force-pushed the codex/audio-data-support branch from 571a576 to cdd12d6 Compare May 11, 2026 07:16
@SanftMonster
Copy link
Copy Markdown
Contributor Author

@vermouth1992 @wuxibin89 I've resolved the conflict, could you please approve the workflow again?

jaxlliu and others added 2 commits May 12, 2026 14:25
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
@SanftMonster SanftMonster force-pushed the codex/audio-data-support branch from cdd12d6 to 11df606 Compare May 12, 2026 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants