[ci, vllm] chore: update vllm-omni 0.18.0 official release and Miscellaneous (#5809)

AndyZhou952 · web-flow · commit 6ae196d44555 · 2026-04-14T14:05:08.000+08:00
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - **vLLM / vllm-omni 0.18.0** official release in the vLLM-Omni CI workflow (replacing a git SHA install), adds TP support. - Aligns the FlowGRPO `QwenImagePipelineWithLogProb` example with upstream’s test pipeline `__init__` pattern [`in vllm-omni`](https://github.com/vllm-project/vllm-omni/blob/v0.18.0/tests/e2e/offline_inference/custom_pipeline/qwen_image_pipeline_with_logprob.py). - Updates Omni sampling tests to use `true_cfg_scale` instead of `guidance_scale` for Qwen-Image-style CFG. - Enables `tensor_model_parallel_size = 2` in the diffusion agent loop test. **Remark: `tiny-random/Qwen-Image` has config `num_attention_heads` being 1, and thus we need to manually create a tmp file based on it to properly test TP behavior.** ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test `tests/experimental/agent_loop/test_diffusion_agent_loop.py` and `test_vllm_omni_generate.py` updated and passed ``` tests/experimental/agent_loop/test_diffusion_agent_loop.py . [100%] ======================================== warnings summary ========================================= ../../miniforge3/envs/vllm-omni-dev/lib/python3.12/site-packages/requests/__init__.py:113 /scratch/fq9hpsac/mikecheung/miniforge3/envs/vllm-omni-dev/lib/python3.12/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.6.3) or chardet (6.0.0.post1)/charset_normalizer (3.4.4) doesn't match a supported version! warnings.warn( ../../miniforge3/envs/vllm-omni-dev/lib/python3.12/site-packages/ray/util/state/util.py:55 /scratch/fq9hpsac/mikecheung/miniforge3/envs/vllm-omni-dev/lib/python3.12/site-packages/ray/util/state/util.py:55: DeprecationWarning: Ray state API is no longer experimental. Please import from `ray.util.state` instead. Importing from `ray.experimental` will be deprecated in future releases. warnings.warn( <frozen importlib._bootstrap>:488 <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:488 <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute ../../miniforge3/envs/vllm-omni-dev/lib/python3.12/site-packages/torch/jit/_script.py:362: 14 warnings /scratch/fq9hpsac/mikecheung/miniforge3/envs/vllm-omni-dev/lib/python3.12/site-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`. warnings.warn( tests/experimental/agent_loop/test_diffusion_agent_loop.py::test_single_turn /scratch/fq9hpsac/mikecheung/gitlocal/verl/tests/experimental/agent_loop/test_diffusion_agent_loop.py:63: UserWarning: The version_base parameter is not specified. Please specify a compatability version level, or None. Will assume defaults for version 1.1 with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")): tests/experimental/agent_loop/test_diffusion_agent_loop.py::test_single_turn /scratch/fq9hpsac/mikecheung/miniforge3/envs/vllm-omni-dev/lib/python3.12/site-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( tests/experimental/agent_loop/test_diffusion_agent_loop.py::test_single_turn /scratch/fq9hpsac/mikecheung/miniforge3/envs/vllm-omni-dev/lib/python3.12/site-packages/pydub/utils.py:14: DeprecationWarning: 'audioop' is deprecated and slated for removal in Python 3.13 import audioop tests/experimental/agent_loop/test_diffusion_agent_loop.py::test_single_turn /scratch/fq9hpsac/mikecheung/miniforge3/envs/vllm-omni-dev/lib/python3.12/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning) tests/experimental/agent_loop/test_diffusion_agent_loop.py::test_single_turn /scratch/fq9hpsac/mikecheung/miniforge3/envs/vllm-omni-dev/lib/python3.12/site-packages/vllm_omni/entrypoints/openai/protocol/audio.py:112: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/ class CreateAudio(BaseModel): -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========================= 1 passed, 23 warnings in 105.69s (0:01:45) ========================== sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute ``` > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example NA > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. NA ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
diff --git a/.github/workflows/vllm_omni.yml b/.github/workflows/vllm_omni.yml
@@ -111,7 +111,7 @@ jobs:
           pip3 install --no-deps -e .
       - name: Install vllm-omni
         run: |
-          pip3 install git+https://github.com/vllm-project/vllm-omni.git@a90a769
+          pip3 install 'vllm-omni==0.18.0'
       - name: Test vLLM Omni generate
         run: |
           ray stop --force
diff --git a/examples/flowgrpo_trainer/vllm_omni/pipeline_qwenimage.py b/examples/flowgrpo_trainer/vllm_omni/pipeline_qwenimage.py
@@ -15,15 +15,10 @@
 from typing import Any, Literal
 
 import torch
-from diffusers.models.autoencoders.autoencoder_kl_qwenimage import AutoencoderKLQwenImage
-from transformers import Qwen2_5_VLForConditionalGeneration
 from vllm_omni.diffusion.data import DiffusionOutput, OmniDiffusionConfig
 from vllm_omni.diffusion.distributed.utils import get_local_device
-from vllm_omni.diffusion.model_loader.diffusers_loader import DiffusersPipelineLoader
 from vllm_omni.diffusion.models.qwen_image import QwenImagePipeline
-from vllm_omni.diffusion.models.qwen_image.qwen_image_transformer import QwenImageTransformer2DModel
 from vllm_omni.diffusion.request import OmniDiffusionRequest
-from vllm_omni.diffusion.utils.tf_utils import get_transformer_config_kwargs
 
 from ..scheduler import FlowMatchSDEDiscreteScheduler
 
@@ -38,19 +33,7 @@ def _maybe_to_cpu(v):
 # This is compatible with API of vllm-omni custom pipeline
 class QwenImagePipelineWithLogProb(QwenImagePipeline):
     def __init__(self, *, od_config: OmniDiffusionConfig, prefix: str = ""):
-        super(QwenImagePipeline, self).__init__()
-        self.od_config = od_config
-        self.parallel_config = od_config.parallel_config
-        self.weights_sources = [
-            DiffusersPipelineLoader.ComponentSource(
-                model_or_path=od_config.model,
-                subfolder="transformer",
-                revision=None,
-                prefix="transformer.",
-                fall_back_to_pt=True,
-            )
-        ]
-
+        super().__init__(od_config=od_config, prefix=prefix)
         self.device = get_local_device()
         model = od_config.model
         # Check if model is a local path
@@ -59,27 +42,6 @@ def __init__(self, *, od_config: OmniDiffusionConfig, prefix: str = ""):
         self.scheduler = FlowMatchSDEDiscreteScheduler.from_pretrained(
             model, subfolder="scheduler", local_files_only=local_files_only
         )
-        self.text_encoder = Qwen2_5_VLForConditionalGeneration.from_pretrained(
-            model, subfolder="text_encoder", local_files_only=local_files_only
-        )
-        self.vae = AutoencoderKLQwenImage.from_pretrained(model, subfolder="vae", local_files_only=local_files_only).to(
-            self.device
-        )
-        transformer_kwargs = get_transformer_config_kwargs(od_config.tf_model_config, QwenImageTransformer2DModel)
-
-        self.transformer = QwenImageTransformer2DModel(od_config=od_config, **transformer_kwargs)
-
-        self.stage = None
-
-        self.vae_scale_factor = 2 ** len(self.vae.temperal_downsample) if getattr(self, "vae", None) else 8
-        # QwenImage latents are turned into 2x2 patches and packed.
-        # This means the latent width and height has to be divisible
-        # by the patch size. So the vae scale factor is multiplied by the patch size to account for this
-        # self.image_processor = VaeImageProcessor(
-        #     vae_scale_factor=self.vae_scale_factor * 2
-        # )
-        self.prompt_template_encode_start_idx = 34
-        self.default_sample_size = 128
 
     def _get_qwen_prompt_embeds(
         self,
diff --git a/tests/experimental/agent_loop/test_diffusion_agent_loop.py b/tests/experimental/agent_loop/test_diffusion_agent_loop.py
@@ -12,6 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import os
+import shutil
+import tempfile
 
 import numpy as np
 import pytest
@@ -24,49 +26,80 @@
 pytestmark = pytest.mark.vllm_omni
 
 
+def _create_tp_compatible_model(parent_dir, src_model_path, num_attention_heads=2):
+    """Copy base model and recreate transformer on-the-fly with TP-compatible head count.
+
+    The tiny-random Qwen-Image model has num_attention_heads=1 in its transformer config,
+    which is not divisible by tensor_model_parallel_size=2. This helper copies the full
+    model directory (vae, text_encoder, tokenizer, scheduler) and overwrites only the
+    transformer component with a freshly-initialized one that has the desired head count.
+    """
+    from diffusers import QwenImageTransformer2DModel
+
+    dst = os.path.join(parent_dir, "Qwen-Image")
+    shutil.copytree(src_model_path, dst)
+
+    transformer = QwenImageTransformer2DModel(
+        num_attention_heads=num_attention_heads,
+        attention_head_dim=32,
+        num_layers=2,
+        in_channels=64,
+        out_channels=16,
+        patch_size=2,
+        joint_attention_dim=32,
+        axes_dims_rope=(8, 12, 12),
+        guidance_embeds=False,
+    )
+    transformer.save_pretrained(os.path.join(dst, "transformer"))
+
+    return dst
+
+
 @pytest.fixture
 def init_config() -> DictConfig:
     from hydra import compose, initialize_config_dir
 
     with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")):
         config = compose(config_name="diffusion_trainer")
 
-    model_path = os.path.expanduser("~/models/tiny-random/Qwen-Image")
-    config.actor_rollout_ref.model.path = model_path
-    config.actor_rollout_ref.model.tokenizer_path = os.path.join(model_path, "tokenizer")
-    config.actor_rollout_ref.rollout.name = "vllm_omni"
-    config.actor_rollout_ref.rollout.mode = "async"
-    config.actor_rollout_ref.rollout.enforce_eager = True
-    config.actor_rollout_ref.rollout.n = 4
-    config.actor_rollout_ref.rollout.num_inference_steps = 10
-    config.actor_rollout_ref.rollout.calculate_log_probs = True
-    config.actor_rollout_ref.rollout.agent.num_workers = 2
-    config.actor_rollout_ref.rollout.agent.default_agent_loop = "diffusion_single_turn_agent"
-    tokenizer_max_length = 1024
-    prompt_template_encode_start_idx = 34
-    max_length = tokenizer_max_length + prompt_template_encode_start_idx
-
-    with open_dict(config.actor_rollout_ref.model.extra_configs):
-        config.actor_rollout_ref.model.extra_configs.true_cfg_scale = 4.0
-        config.actor_rollout_ref.model.extra_configs.max_sequence_length = max_length
-        config.actor_rollout_ref.model.extra_configs.noise_level = 1.0
-        config.actor_rollout_ref.model.extra_configs.sde_window_size = 2
-        config.actor_rollout_ref.model.extra_configs.sde_window_range = [0, 5]
-
-    config.actor_rollout_ref.rollout.nnodes = 1
-
-    qwen_pipeline = "examples.flowgrpo_trainer.vllm_omni.pipeline_qwenimage.QwenImagePipelineWithLogProb"
-    config.actor_rollout_ref.rollout.engine_kwargs.vllm_omni = {"custom_pipeline": qwen_pipeline}
-    config.reward.reward_manager.name = "image"
-    config.trainer.n_gpus_per_node = 4
-
-    config.data.apply_chat_template_kwargs = dict(max_length=max_length, padding=True, truncation=True)
-    config.data.max_prompt_length = max_length
-    config.actor_rollout_ref.rollout.max_model_len = max_length
-
-    # TODO (mike): test with TP later
-    config.actor_rollout_ref.rollout.tensor_model_parallel_size = 1
-    return config
+    base_model_path = os.path.expanduser("~/models/tiny-random/Qwen-Image")
+    with tempfile.TemporaryDirectory() as tmp_dir:
+        model_path = _create_tp_compatible_model(tmp_dir, base_model_path, num_attention_heads=2)
+        config.actor_rollout_ref.model.path = model_path
+        config.actor_rollout_ref.model.tokenizer_path = os.path.join(model_path, "tokenizer")
+        config.actor_rollout_ref.rollout.name = "vllm_omni"
+        config.actor_rollout_ref.rollout.mode = "async"
+        config.actor_rollout_ref.rollout.enforce_eager = True
+        config.actor_rollout_ref.rollout.n = 4
+        config.actor_rollout_ref.rollout.num_inference_steps = 10
+        config.actor_rollout_ref.rollout.calculate_log_probs = True
+        config.actor_rollout_ref.rollout.agent.num_workers = 2
+        config.actor_rollout_ref.rollout.agent.default_agent_loop = "diffusion_single_turn_agent"
+        tokenizer_max_length = 1024
+        prompt_template_encode_start_idx = 34
+        max_length = tokenizer_max_length + prompt_template_encode_start_idx
+
+        with open_dict(config.actor_rollout_ref.model.extra_configs):
+            config.actor_rollout_ref.model.extra_configs.true_cfg_scale = 4.0
+            config.actor_rollout_ref.model.extra_configs.max_sequence_length = max_length
+            config.actor_rollout_ref.model.extra_configs.noise_level = 1.0
+            config.actor_rollout_ref.model.extra_configs.sde_window_size = 2
+            config.actor_rollout_ref.model.extra_configs.sde_window_range = [0, 5]
+
+        config.actor_rollout_ref.rollout.nnodes = 1
+
+        qwen_pipeline = "examples.flowgrpo_trainer.vllm_omni.pipeline_qwenimage.QwenImagePipelineWithLogProb"
+        config.actor_rollout_ref.rollout.engine_kwargs.vllm_omni = {"custom_pipeline": qwen_pipeline}
+        config.reward.reward_manager.name = "image"
+        config.trainer.n_gpus_per_node = 4
+
+        config.data.apply_chat_template_kwargs = dict(max_length=max_length, padding=True, truncation=True)
+        config.data.max_prompt_length = max_length
+        config.actor_rollout_ref.rollout.max_model_len = max_length
+
+        config.actor_rollout_ref.rollout.tensor_model_parallel_size = 2
+
+        yield config
 
 
 def test_single_turn(init_config):
diff --git a/tests/workers/rollout/rollout_vllm/test_vllm_omni_generate.py b/tests/workers/rollout/rollout_vllm/test_vllm_omni_generate.py
@@ -160,7 +160,7 @@ def test_generate(init_server):
             prompt_ids=prompt_ids,
             sampling_params={
                 "num_inference_steps": 10,
-                "guidance_scale": 4.0,
+                "true_cfg_scale": 4.0,
                 "height": 512,
                 "width": 512,
             },
@@ -195,7 +195,7 @@ def test_generate_with_logprobs(init_server):
             prompt_ids=prompt_ids,
             sampling_params={
                 "num_inference_steps": 10,
-                "guidance_scale": 4.0,
+                "true_cfg_scale": 4.0,
                 "height": 512,
                 "width": 512,
                 "logprobs": True,
@@ -244,7 +244,7 @@ def test_generate_concurrent(init_server):
             prompt_ids=_tokenize_prompt(prompts[i]),
             sampling_params={
                 "num_inference_steps": 10,
-                "guidance_scale": 4.0,
+                "true_cfg_scale": 4.0,
                 "height": 512,
                 "width": 512,
             },