bugfix: fix longcat-image cache dispatch by DefTruth · Pull Request #638 · vllm-project/vllm-omni

DefTruth · 2026-01-05T06:59:48Z

fixed #630

pipeline name mismatch leading to wrong cache dispatch.

w/ this pr

w/ cache: 17.122s

python text_to_image.py \
    --model $LONGCAT_IMAGE_DIR \
    --prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting" \
    --output output_image_edit.png \
    --num_inference_steps 50 \
    --cfg_scale 4.0 \
    --cache_backend cache_dit
WARNING 01-05 07:48:45 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
INFO 01-05 07:48:46 [omni.py:122] Initializing stages for model: /workspace/dev/vipdev/hf_models/LongCat-Image
INFO 01-05 07:48:46 [initialization.py:35] No OmniTransferConfig provided
INFO 01-05 07:48:46 [omni_stage.py:107] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'diffusion', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model': '/workspace/dev/vipdev/hf_models/LongCat-Image', 'vae_use_slicing': False, 'vae_use_tiling': False, 'cache_backend': 'cache_dit', 'cache_config': {'Fn_compute_blocks': 1, 'Bn_compute_blocks': 0, 'max_warmup_steps': 4, 'residual_diff_threshold': 0.24, 'max_continuous_cached_steps': 3, 'enable_taylorseer': False, 'taylorseer_order': 1, 'scm_steps_mask_policy': None, 'scm_steps_policy': 'dynamic'}, 'parallel_config': {'pipeline_parallel_size': 1, 'data_parallel_size': 1, 'tensor_parallel_size': 1, 'sequence_parallel_size': 1, 'ulysses_degree': 1, 'ring_degree': 1, 'cfg_parallel_size': 1}, 'model_stage': 'diffusion'}, 'final_output': True, 'final_output_type': 'image'}
INFO 01-05 07:48:46 [omni.py:297] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] WARNING 01-05 07:48:54 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] INFO 01-05 07:48:54 [omni_stage.py:434] Starting stage worker with model: /workspace/dev/vipdev/hf_models/LongCat-Image
[Stage-0] WARNING 01-05 07:48:55 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 07:48:55 [diffusion_engine.py:213] Starting server...
[Stage-0] WARNING 01-05 07:49:03 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-05 07:49:04 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 07:49:04 [gpu_worker.py:174] Worker 0 created result MessageQueue
[Stage-0] INFO 01-05 07:49:04 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
[W105 07:49:04.723704639 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [localhost]:30089 (errno: 97 - Address family not supported by protocol).
[W105 07:49:04.724641872 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [10.189.108.254]:30089 (errno: 97 - Address family not supported by protocol).
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 01-05 07:49:04 [gpu_worker.py:75] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.02it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.44s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.44s/it]

[Stage-0] INFO 01-05 07:49:15 [diffusers_loader.py:214] Loading weights took 3.87 seconds
[Stage-0] INFO 01-05 07:49:16 [gpu_worker.py:97] Model loading took 37.7358 GiB and 11.258677 seconds
[Stage-0] INFO 01-05 07:49:16 [gpu_worker.py:102] Worker 0: Model loaded successfully.
[Stage-0] INFO 01-05 07:49:16 [cache_dit_backend.py:468] Using custom cache-dit enabler for model: LongCatImagePipeline
[Stage-0] INFO 01-05 07:49:16 [cache_dit_backend.py:211] Enabling cache-dit on LongCatImage transformer with BlockAdapter: Fn=1, Bn=0, W=4,
WARNING 01-05 07:49:16 [block_adapters.py:131] pipe is None, use FakeDiffusionPipeline instead.
INFO 01-05 07:49:16 [block_adapters.py:220] Auto fill blocks_name: ['transformer_blocks', 'single_transformer_blocks'].
INFO 01-05 07:49:16 [block_adapters.py:153] Found transformer NOT from diffusers: vllm_omni.diffusion.models.longcat_image.longcat_image_transformer disable check_forward_pattern by default.
INFO 01-05 07:49:16 [cache_adapter.py:77] Adapting Cache Acceleration using custom BlockAdapter!
WARNING 01-05 07:49:16 [block_adapters.py:469] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
WARNING 01-05 07:49:16 [block_adapters.py:469] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 07:49:16 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FakeDiffusionPipeline.
INFO 01-05 07:49:16 [cache_adapter.py:332] Collected Context Config: DBCache_F1B0_W4I1M0MC3_R0.24_CFG0, Calibrator Config: None
WARNING 01-05 07:49:16 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 07:49:16 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140031439747024, context_manager: FakeDiffusionPipeline_140031439852352.
WARNING 01-05 07:49:16 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 07:49:16 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for single_transformer_blocks, cache_context: single_transformer_blocks_140031439017008, context_manager: FakeDiffusionPipeline_140031439852352.
[Stage-0] INFO 01-05 07:49:16 [cache_dit_backend.py:475] Cache-dit enabled successfully on LongCatImagePipeline
[Stage-0] INFO 01-05 07:49:16 [gpu_worker.py:310] Worker 0: Scheduler loop started.
[Stage-0] INFO 01-05 07:49:16 [gpu_worker.py:233] Worker 0 ready to receive requests via shared memory
[Stage-0] INFO 01-05 07:49:16 [scheduler.py:46] SyncScheduler initialized result MessageQueue
[Stage-0] INFO 01-05 07:49:16 [omni_stage.py:627] Max batch size: 1
INFO 01-05 07:49:16 [omni.py:290] [Orchestrator] Stage-0 reported ready
INFO 01-05 07:49:16 [omni.py:316] [Orchestrator] All stages initialized successfully

============================================================
Generation Configuration:
  Model: /workspace/dev/vipdev/hf_models/LongCat-Image
  Inference steps: 50
  Cache backend: cache_dit
  Parallel configuration: ulysses_degree=1, ring_degree=1
  Image size: 1024x1024
============================================================

Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s[Stage-0] INFO 01-05 07:49:16 [omni_diffusion.py:112] Prepared 1 requests for generation.                                                                                        | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s]
[Stage-0] INFO 01-05 07:49:16 [cache_dit_backend.py:496] Refreshing cache context for transformer with num_inference_steps: 50
INFO 01-05 07:49:16 [cache_adapter.py:723] ✅ Refreshed cache context: transformer_blocks_140031439747024, Collected Context Config: DBCache_F1B0_W4I1M0MC3_R0.24_N50_CFG0, Calibrator Config: None
INFO 01-05 07:49:16 [cache_adapter.py:723] ✅ Refreshed cache context: single_transformer_blocks_140031439017008, Collected Context Config: DBCache_F1B0_W4I1M0MC3_R0.24_N50_CFG0, Calibrator Config: None
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[Stage-0] INFO 01-05 07:49:33 [diffusion_engine.py:86] Generation completed successfully.
[Stage-0] INFO 01-05 07:49:33 [diffusion_engine.py:109] Post-processing completed in 0.0955 seconds
INFO 01-05 07:49:33 [log_utils.py:549] {'type': 'request_level_metrics',
INFO 01-05 07:49:33 [log_utils.py:549]  'request_id': '0_45567883-3c2e-4cfa-a36e-9aa752e94fb0',
INFO 01-05 07:49:33 [log_utils.py:549]  'e2e_time_ms': 17122.5323677063,
INFO 01-05 07:49:33 [log_utils.py:549]  'e2e_tpt': 0.0,
INFO 01-05 07:49:33 [log_utils.py:549]  'e2e_total_tokens': 0,
INFO 01-05 07:49:33 [log_utils.py:549]  'transfers_total_time_ms': 0.0,
INFO 01-05 07:49:33 [log_utils.py:549]  'transfers_total_bytes': 0,
INFO 01-05 07:49:33 [log_utils.py:549]  'stages': {0: {'stage_gen_time_ms': 17087.39161491394,
INFO 01-05 07:49:33 [log_utils.py:549]                 'num_tokens_out': 0,
INFO 01-05 07:49:33 [log_utils.py:549]                 'num_tokens_in': 0}}}
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:17<00:00, 17.12s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 07:49:33 [omni.py:711] [Summary] {'e2e_requests': 1,████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:17<00:00, 17.12s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 07:49:33 [omni.py:711]  'e2e_total_time_ms': 17123.9755153656,
INFO 01-05 07:49:33 [omni.py:711]  'e2e_sum_time_ms': 17122.5323677063,
INFO 01-05 07:49:33 [omni.py:711]  'e2e_total_tokens': 0,
INFO 01-05 07:49:33 [omni.py:711]  'e2e_avg_time_per_request_ms': 17122.5323677063,
INFO 01-05 07:49:33 [omni.py:711]  'e2e_avg_tokens_per_s': 0.0,
INFO 01-05 07:49:33 [omni.py:711]  'wall_time_ms': 17123.9755153656,
INFO 01-05 07:49:33 [omni.py:711]  'final_stage_id': {'0_45567883-3c2e-4cfa-a36e-9aa752e94fb0': 0},
INFO 01-05 07:49:33 [omni.py:711]  'stages': [{'stage_id': 0,
INFO 01-05 07:49:33 [omni.py:711]              'requests': 1,
INFO 01-05 07:49:33 [omni.py:711]              'tokens': 0,
INFO 01-05 07:49:33 [omni.py:711]              'total_time_ms': 17122.780323028564,
INFO 01-05 07:49:33 [omni.py:711]              'avg_time_per_request_ms': 17122.780323028564,
INFO 01-05 07:49:33 [omni.py:711]              'avg_tokens_per_s': 0.0}],
INFO 01-05 07:49:33 [omni.py:711]  'transfers': []}
Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:17<?, ?it/s]
[Stage-0] INFO 01-05 07:49:33 [omni_stage.py:636] Received shutdown signal
[Stage-0] INFO 01-05 07:49:33 [gpu_worker.py:265] Worker 0: Received shutdown message
[Stage-0] INFO 01-05 07:49:33 [gpu_worker.py:287] event loop terminated.
[Stage-0] INFO 01-05 07:49:33 [gpu_worker.py:318] Worker 0: Shutdown complete.
Total generation time: 21.5126 seconds (21512.61 ms)
INFO 01-05 07:49:37 [text_to_image.py:168] Outputs: [OmniRequestOutput(request_id='', finished=True, stage_id=0, final_output_type='image', request_output=[OmniRequestOutput(request_id='0_45567883-3c2e-4cfa-a36e-9aa752e94fb0', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt="Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting", latents=None, metrics={})], images=[], prompt=None, latents=None, metrics={})]
Saved generated image to output_image_edit.png

w/o cache: 42.39s

python text_to_image.py \
    --model $LONGCAT_IMAGE_DIR \
    --prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting" \
    --output output_image_edit_nocache.png \
    --num_inference_steps 50 \
    --cfg_scale 4.0
WARNING 01-05 07:52:18 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
INFO 01-05 07:52:18 [omni.py:122] Initializing stages for model: /workspace/dev/vipdev/hf_models/LongCat-Image
INFO 01-05 07:52:18 [initialization.py:35] No OmniTransferConfig provided
INFO 01-05 07:52:18 [omni_stage.py:107] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'diffusion', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model': '/workspace/dev/vipdev/hf_models/LongCat-Image', 'vae_use_slicing': False, 'vae_use_tiling': False, 'cache_backend': None, 'cache_config': None, 'parallel_config': {'pipeline_parallel_size': 1, 'data_parallel_size': 1, 'tensor_parallel_size': 1, 'sequence_parallel_size': 1, 'ulysses_degree': 1, 'ring_degree': 1, 'cfg_parallel_size': 1}, 'model_stage': 'diffusion'}, 'final_output': True, 'final_output_type': 'image'}
INFO 01-05 07:52:18 [omni.py:297] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] WARNING 01-05 07:52:27 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] INFO 01-05 07:52:27 [omni_stage.py:434] Starting stage worker with model: /workspace/dev/vipdev/hf_models/LongCat-Image
[Stage-0] WARNING 01-05 07:52:28 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 07:52:28 [diffusion_engine.py:213] Starting server...
[Stage-0] WARNING 01-05 07:52:36 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-05 07:52:37 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 07:52:37 [gpu_worker.py:174] Worker 0 created result MessageQueue
[Stage-0] INFO 01-05 07:52:37 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
[W105 07:52:37.848827436 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [localhost]:30063 (errno: 97 - Address family not supported by protocol).
[W105 07:52:37.850094326 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [10.189.108.254]:30063 (errno: 97 - Address family not supported by protocol).
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 01-05 07:52:37 [gpu_worker.py:75] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.09it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.90s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.90s/it]

[Stage-0] INFO 01-05 07:52:48 [diffusers_loader.py:214] Loading weights took 3.82 seconds
[Stage-0] INFO 01-05 07:52:48 [gpu_worker.py:97] Model loading took 37.7358 GiB and 10.921257 seconds
[Stage-0] INFO 01-05 07:52:48 [gpu_worker.py:102] Worker 0: Model loaded successfully.
[Stage-0] INFO 01-05 07:52:48 [gpu_worker.py:310] Worker 0: Scheduler loop started.
[Stage-0] INFO 01-05 07:52:48 [gpu_worker.py:233] Worker 0 ready to receive requests via shared memory
[Stage-0] INFO 01-05 07:52:48 [scheduler.py:46] SyncScheduler initialized result MessageQueue
[Stage-0] INFO 01-05 07:52:48 [omni_stage.py:627] Max batch size: 1
INFO 01-05 07:52:48 [omni.py:290] [Orchestrator] Stage-0 reported ready
INFO 01-05 07:52:48 [omni.py:316] [Orchestrator] All stages initialized successfully

============================================================
Generation Configuration:
  Model: /workspace/dev/vipdev/hf_models/LongCat-Image
  Inference steps: 50
  Cache backend: None (no acceleration)
  Parallel configuration: ulysses_degree=1, ring_degree=1
  Image size: 1024x1024
============================================================

Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s[Stage-0] INFO 01-05 07:52:48 [omni_diffusion.py:112] Prepared 1 requests for generation.                                                                                        | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s]
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[Stage-0] INFO 01-05 07:53:31 [diffusion_engine.py:86] Generation completed successfully.
[Stage-0] INFO 01-05 07:53:31 [diffusion_engine.py:109] Post-processing completed in 0.0922 seconds
INFO 01-05 07:53:31 [log_utils.py:549] {'type': 'request_level_metrics',
INFO 01-05 07:53:31 [log_utils.py:549]  'request_id': '0_2dc245b9-ac08-43d9-ae97-23e33938d612',
INFO 01-05 07:53:31 [log_utils.py:549]  'e2e_time_ms': 42394.21105384827,
INFO 01-05 07:53:31 [log_utils.py:549]  'e2e_tpt': 0.0,
INFO 01-05 07:53:31 [log_utils.py:549]  'e2e_total_tokens': 0,
INFO 01-05 07:53:31 [log_utils.py:549]  'transfers_total_time_ms': 0.0,
INFO 01-05 07:53:31 [log_utils.py:549]  'transfers_total_bytes': 0,
INFO 01-05 07:53:31 [log_utils.py:549]  'stages': {0: {'stage_gen_time_ms': 42355.54218292236,
INFO 01-05 07:53:31 [log_utils.py:549]                 'num_tokens_out': 0,
INFO 01-05 07:53:31 [log_utils.py:549]                 'num_tokens_in': 0}}}
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:42<00:00, 42.39s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 07:53:31 [omni.py:711] [Summary] {'e2e_requests': 1,████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:42<00:00, 42.39s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 07:53:31 [omni.py:711]  'e2e_total_time_ms': 42396.44718170166,
INFO 01-05 07:53:31 [omni.py:711]  'e2e_sum_time_ms': 42394.21105384827,
INFO 01-05 07:53:31 [omni.py:711]  'e2e_total_tokens': 0,
INFO 01-05 07:53:31 [omni.py:711]  'e2e_avg_time_per_request_ms': 42394.21105384827,
INFO 01-05 07:53:31 [omni.py:711]  'e2e_avg_tokens_per_s': 0.0,
INFO 01-05 07:53:31 [omni.py:711]  'wall_time_ms': 42396.44718170166,
INFO 01-05 07:53:31 [omni.py:711]  'final_stage_id': {'0_2dc245b9-ac08-43d9-ae97-23e33938d612': 0},
INFO 01-05 07:53:31 [omni.py:711]  'stages': [{'stage_id': 0,
INFO 01-05 07:53:31 [omni.py:711]              'requests': 1,
INFO 01-05 07:53:31 [omni.py:711]              'tokens': 0,
INFO 01-05 07:53:31 [omni.py:711]              'total_time_ms': 42394.585609436035,
INFO 01-05 07:53:31 [omni.py:711]              'avg_time_per_request_ms': 42394.585609436035,
INFO 01-05 07:53:31 [omni.py:711]              'avg_tokens_per_s': 0.0}],
INFO 01-05 07:53:31 [omni.py:711]  'transfers': []}
Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:42<?, ?it/s]
[Stage-0] INFO 01-05 07:53:31 [omni_stage.py:636] Received shutdown signal
[Stage-0] INFO 01-05 07:53:31 [gpu_worker.py:265] Worker 0: Received shutdown message
[Stage-0] INFO 01-05 07:53:31 [gpu_worker.py:287] event loop terminated.
[Stage-0] INFO 01-05 07:53:31 [gpu_worker.py:318] Worker 0: Shutdown complete.
Total generation time: 45.9301 seconds (45930.09 ms)
INFO 01-05 07:53:34 [text_to_image.py:168] Outputs: [OmniRequestOutput(request_id='', finished=True, stage_id=0, final_output_type='image', request_output=[OmniRequestOutput(request_id='0_2dc245b9-ac08-43d9-ae97-23e33938d612', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt="Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting", latents=None, metrics={})], images=[], prompt=None, latents=None, metrics={})]
Saved generated image to output_image_edit_nocache.png

longcat-image-edit w/ cache-dit: 80.8s

 python3 image_edit.py --image qwen_bear.png --prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting" --output output_image_edit.png --num_inference_steps 50 --guidance_scale 4.5 --seed 42 --model $LONGCAT_IMAGE_EDIT_DIR  --cache_backend cache_dit  --cache_dit_max_continuous_cached_steps 2
WARNING 01-05 08:08:26 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
INFO 01-05 08:08:26 [omni.py:122] Initializing stages for model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
INFO 01-05 08:08:26 [initialization.py:35] No OmniTransferConfig provided
INFO 01-05 08:08:26 [omni_stage.py:107] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'diffusion', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model': '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit', 'vae_use_slicing': False, 'vae_use_tiling': False, 'cache_backend': 'cache_dit', 'cache_config': {'Fn_compute_blocks': 1, 'Bn_compute_blocks': 0, 'max_warmup_steps': 4, 'residual_diff_threshold': 0.24, 'max_continuous_cached_steps': 2, 'enable_taylorseer': False, 'taylorseer_order': 1, 'scm_steps_mask_policy': None, 'scm_steps_policy': 'dynamic'}, 'parallel_config': {'pipeline_parallel_size': 1, 'data_parallel_size': 1, 'tensor_parallel_size': 1, 'sequence_parallel_size': 1, 'ulysses_degree': 1, 'ring_degree': 1, 'cfg_parallel_size': 1}, 'model_stage': 'diffusion'}, 'final_output': True, 'final_output_type': 'image'}
INFO 01-05 08:08:26 [omni.py:297] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] WARNING 01-05 08:08:34 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] INFO 01-05 08:08:35 [omni_stage.py:434] Starting stage worker with model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
[Stage-0] WARNING 01-05 08:08:35 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 08:08:36 [diffusion_engine.py:213] Starting server...
[Stage-0] WARNING 01-05 08:08:44 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-05 08:08:45 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 08:08:45 [gpu_worker.py:174] Worker 0 created result MessageQueue
[Stage-0] INFO 01-05 08:08:45 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
[W105 08:08:45.572301128 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [localhost]:30008 (errno: 97 - Address family not supported by protocol).
[W105 08:08:45.573489541 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [10.189.108.254]:30008 (errno: 97 - Address family not supported by protocol).
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 01-05 08:08:45 [gpu_worker.py:75] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.06s/it]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.76s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.76s/it]

[Stage-0] INFO 01-05 08:08:57 [diffusers_loader.py:214] Loading weights took 4.41 seconds
[Stage-0] INFO 01-05 08:08:57 [gpu_worker.py:97] Model loading took 37.7358 GiB and 12.267342 seconds
[Stage-0] INFO 01-05 08:08:57 [gpu_worker.py:102] Worker 0: Model loaded successfully.
[Stage-0] INFO 01-05 08:08:57 [cache_dit_backend.py:468] Using custom cache-dit enabler for model: LongCatImageEditPipeline
[Stage-0] INFO 01-05 08:08:57 [cache_dit_backend.py:211] Enabling cache-dit on LongCatImage transformer with BlockAdapter: Fn=1, Bn=0, W=4,
WARNING 01-05 08:08:57 [block_adapters.py:131] pipe is None, use FakeDiffusionPipeline instead.
INFO 01-05 08:08:57 [block_adapters.py:220] Auto fill blocks_name: ['transformer_blocks', 'single_transformer_blocks'].
INFO 01-05 08:08:57 [block_adapters.py:153] Found transformer NOT from diffusers: vllm_omni.diffusion.models.longcat_image.longcat_image_transformer disable check_forward_pattern by default.
INFO 01-05 08:08:57 [cache_adapter.py:77] Adapting Cache Acceleration using custom BlockAdapter!
WARNING 01-05 08:08:57 [block_adapters.py:469] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
WARNING 01-05 08:08:57 [block_adapters.py:469] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 08:08:57 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FakeDiffusionPipeline.
INFO 01-05 08:08:57 [cache_adapter.py:332] Collected Context Config: DBCache_F1B0_W4I1M0MC2_R0.24_CFG0, Calibrator Config: None
WARNING 01-05 08:08:57 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 08:08:57 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140345843095872, context_manager: FakeDiffusionPipeline_140346848452944.
WARNING 01-05 08:08:57 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 08:08:57 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for single_transformer_blocks, cache_context: single_transformer_blocks_140345842854144, context_manager: FakeDiffusionPipeline_140346848452944.
[Stage-0] INFO 01-05 08:08:57 [cache_dit_backend.py:475] Cache-dit enabled successfully on LongCatImageEditPipeline
[Stage-0] INFO 01-05 08:08:57 [gpu_worker.py:310] Worker 0: Scheduler loop started.
[Stage-0] INFO 01-05 08:08:57 [gpu_worker.py:233] Worker 0 ready to receive requests via shared memory
[Stage-0] INFO 01-05 08:08:57 [scheduler.py:46] SyncScheduler initialized result MessageQueue
[Stage-0] INFO 01-05 08:08:57 [omni_stage.py:627] Max batch size: 1
INFO 01-05 08:08:57 [omni.py:290] [Orchestrator] Stage-0 reported ready
INFO 01-05 08:08:57 [omni.py:316] [Orchestrator] All stages initialized successfully
Pipeline loaded

============================================================
Generation Configuration:
  Model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
  Inference steps: 50
  Cache backend: cache_dit
  Input image size: (232, 282)
  Parallel configuration: ulysses_degree=1, ring_degree=1
============================================================

Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s[Stage-0] INFO 01-05 08:08:57 [omni_diffusion.py:112] Prepared 1 requests for generation.                                                                                        | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s]
[Stage-0] INFO 01-05 08:08:58 [diffusion_engine.py:81] Pre-processing completed in 0.2521 seconds
[Stage-0] INFO 01-05 08:08:58 [cache_dit_backend.py:496] Refreshing cache context for transformer with num_inference_steps: 50
INFO 01-05 08:08:58 [cache_adapter.py:723] ✅ Refreshed cache context: transformer_blocks_140345843095872, Collected Context Config: DBCache_F1B0_W4I1M0MC2_R0.24_N50_CFG0, Calibrator Config: None
INFO 01-05 08:08:58 [cache_adapter.py:723] ✅ Refreshed cache context: single_transformer_blocks_140345842854144, Collected Context Config: DBCache_F1B0_W4I1M0MC2_R0.24_N50_CFG0, Calibrator Config: None
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[Stage-0] INFO 01-05 08:09:58 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
[Stage-0] INFO 01-05 08:10:18 [diffusion_engine.py:86] Generation completed successfully.
[Stage-0] INFO 01-05 08:10:18 [diffusion_engine.py:109] Post-processing completed in 0.0752 seconds
INFO 01-05 08:10:18 [log_utils.py:549] {'type': 'request_level_metrics',
INFO 01-05 08:10:18 [log_utils.py:549]  'request_id': '0_c91d09e9-a7fb-4537-8927-39f6b1484cdb',
INFO 01-05 08:10:18 [log_utils.py:549]  'e2e_time_ms': 80810.90354919434,
INFO 01-05 08:10:18 [log_utils.py:549]  'e2e_tpt': 0.0,
INFO 01-05 08:10:18 [log_utils.py:549]  'e2e_total_tokens': 0,
INFO 01-05 08:10:18 [log_utils.py:549]  'transfers_total_time_ms': 0.0,
INFO 01-05 08:10:18 [log_utils.py:549]  'transfers_total_bytes': 0,
INFO 01-05 08:10:18 [log_utils.py:549]  'stages': {0: {'stage_gen_time_ms': 80774.80435371399,
INFO 01-05 08:10:18 [log_utils.py:549]                 'num_tokens_out': 0,
INFO 01-05 08:10:18 [log_utils.py:549]                 'num_tokens_in': 0}}}
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:20<00:00, 80.81s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 08:10:18 [omni.py:711] [Summary] {'e2e_requests': 1,████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:20<00:00, 80.81s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 08:10:18 [omni.py:711]  'e2e_total_time_ms': 80811.88321113586,
INFO 01-05 08:10:18 [omni.py:711]  'e2e_sum_time_ms': 80810.90354919434,
INFO 01-05 08:10:18 [omni.py:711]  'e2e_total_tokens': 0,
INFO 01-05 08:10:18 [omni.py:711]  'e2e_avg_time_per_request_ms': 80810.90354919434,
INFO 01-05 08:10:18 [omni.py:711]  'e2e_avg_tokens_per_s': 0.0,
INFO 01-05 08:10:18 [omni.py:711]  'wall_time_ms': 80811.88321113586,
INFO 01-05 08:10:18 [omni.py:711]  'final_stage_id': {'0_c91d09e9-a7fb-4537-8927-39f6b1484cdb': 0},
INFO 01-05 08:10:18 [omni.py:711]  'stages': [{'stage_id': 0,
INFO 01-05 08:10:18 [omni.py:711]              'requests': 1,
INFO 01-05 08:10:18 [omni.py:711]              'tokens': 0,
INFO 01-05 08:10:18 [omni.py:711]              'total_time_ms': 80811.10501289368,
INFO 01-05 08:10:18 [omni.py:711]              'avg_time_per_request_ms': 80811.10501289368,
INFO 01-05 08:10:18 [omni.py:711]              'avg_tokens_per_s': 0.0}],
INFO 01-05 08:10:18 [omni.py:711]  'transfers': []}
Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [01:20<?, ?it/s]
[Stage-0] INFO 01-05 08:10:18 [omni_stage.py:636] Received shutdown signal
[Stage-0] INFO 01-05 08:10:18 [gpu_worker.py:265] Worker 0: Received shutdown message
[Stage-0] INFO 01-05 08:10:18 [gpu_worker.py:287] event loop terminated.
[Stage-0] INFO 01-05 08:10:18 [gpu_worker.py:318] Worker 0: Shutdown complete.
Total generation time: 85.0818 seconds (85081.80 ms)
INFO 01-05 08:10:23 [image_edit.py:349] Outputs: [OmniRequestOutput(request_id='', finished=True, stage_id=0, final_output_type='image', request_output=[OmniRequestOutput(request_id='0_c91d09e9-a7fb-4537-8927-39f6b1484cdb', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt="Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting", latents=None, metrics={})], images=[], prompt=None, latents=None, metrics={})]
Saved edited image to /workspace/dev/vipshop/vllm-omni/examples/offline_inference/image_to_image/output_image_edit.png

longcat-image-edit w/o cache-dit: 196.4s

python3 image_edit.py --image qwen_bear.png --prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting" --output output_image_edit_nocache.png --num_inference_steps 50 --guidance_scale 4.5 --seed 42 --model $LONGCAT_IMAGE_EDIT_DIR
WARNING 01-05 08:11:04 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
INFO 01-05 08:11:05 [omni.py:122] Initializing stages for model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
INFO 01-05 08:11:05 [initialization.py:35] No OmniTransferConfig provided
INFO 01-05 08:11:05 [omni_stage.py:107] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'diffusion', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model': '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit', 'vae_use_slicing': False, 'vae_use_tiling': False, 'cache_backend': None, 'cache_config': None, 'parallel_config': {'pipeline_parallel_size': 1, 'data_parallel_size': 1, 'tensor_parallel_size': 1, 'sequence_parallel_size': 1, 'ulysses_degree': 1, 'ring_degree': 1, 'cfg_parallel_size': 1}, 'model_stage': 'diffusion'}, 'final_output': True, 'final_output_type': 'image'}
INFO 01-05 08:11:05 [omni.py:297] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] WARNING 01-05 08:11:13 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] INFO 01-05 08:11:13 [omni_stage.py:434] Starting stage worker with model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
[Stage-0] WARNING 01-05 08:11:14 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 08:11:15 [diffusion_engine.py:213] Starting server...
[Stage-0] WARNING 01-05 08:11:22 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-05 08:11:24 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 08:11:24 [gpu_worker.py:174] Worker 0 created result MessageQueue
[Stage-0] INFO 01-05 08:11:24 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
[W105 08:11:24.162905591 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [localhost]:30038 (errno: 97 - Address family not supported by protocol).
[W105 08:11:24.163894440 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [10.189.108.254]:30038 (errno: 97 - Address family not supported by protocol).
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 01-05 08:11:24 [gpu_worker.py:75] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.18it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.43s/it]

[Stage-0] INFO 01-05 08:11:33 [diffusers_loader.py:214] Loading weights took 3.47 seconds
[Stage-0] INFO 01-05 08:11:34 [gpu_worker.py:97] Model loading took 37.7358 GiB and 10.241416 seconds
[Stage-0] INFO 01-05 08:11:34 [gpu_worker.py:102] Worker 0: Model loaded successfully.
[Stage-0] INFO 01-05 08:11:34 [gpu_worker.py:310] Worker 0: Scheduler loop started.
[Stage-0] INFO 01-05 08:11:34 [gpu_worker.py:233] Worker 0 ready to receive requests via shared memory
[Stage-0] INFO 01-05 08:11:34 [scheduler.py:46] SyncScheduler initialized result MessageQueue
[Stage-0] INFO 01-05 08:11:34 [omni_stage.py:627] Max batch size: 1
INFO 01-05 08:11:34 [omni.py:290] [Orchestrator] Stage-0 reported ready
INFO 01-05 08:11:34 [omni.py:316] [Orchestrator] All stages initialized successfully
Pipeline loaded

============================================================
Generation Configuration:
  Model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
  Inference steps: 50
  Cache backend: None (no acceleration)
  Input image size: (232, 282)
  Parallel configuration: ulysses_degree=1, ring_degree=1
============================================================

Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s[Stage-0] INFO 01-05 08:11:34 [omni_diffusion.py:112] Prepared 1 requests for generation.                                                                                        | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s]
[Stage-0] INFO 01-05 08:11:34 [diffusion_engine.py:81] Pre-processing completed in 0.0889 seconds
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[Stage-0] INFO 01-05 08:12:34 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
[Stage-0] INFO 01-05 08:13:34 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
[Stage-0] INFO 01-05 08:14:34 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
[Stage-0] INFO 01-05 08:14:50 [diffusion_engine.py:86] Generation completed successfully.
[Stage-0] INFO 01-05 08:14:50 [diffusion_engine.py:109] Post-processing completed in 0.0811 seconds
INFO 01-05 08:14:50 [log_utils.py:549] {'type': 'request_level_metrics',
INFO 01-05 08:14:50 [log_utils.py:549]  'request_id': '0_e356bce6-53d4-4d75-881c-dee30a6f6507',
INFO 01-05 08:14:50 [log_utils.py:549]  'e2e_time_ms': 196497.98941612244,
INFO 01-05 08:14:50 [log_utils.py:549]  'e2e_tpt': 0.0,
INFO 01-05 08:14:50 [log_utils.py:549]  'e2e_total_tokens': 0,
INFO 01-05 08:14:50 [log_utils.py:549]  'transfers_total_time_ms': 0.0,
INFO 01-05 08:14:50 [log_utils.py:549]  'transfers_total_bytes': 0,
INFO 01-05 08:14:50 [log_utils.py:549]  'stages': {0: {'stage_gen_time_ms': 196457.38887786865,
INFO 01-05 08:14:50 [log_utils.py:549]                 'num_tokens_out': 0,
INFO 01-05 08:14:50 [log_utils.py:549]                 'num_tokens_in': 0}}}
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [03:16<00:00, 196.50s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 08:14:50 [omni.py:711] [Summary] {'e2e_requests': 1,███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [03:16<00:00, 196.50s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 08:14:50 [omni.py:711]  'e2e_total_time_ms': 196499.73726272583,
INFO 01-05 08:14:50 [omni.py:711]  'e2e_sum_time_ms': 196497.98941612244,
INFO 01-05 08:14:50 [omni.py:711]  'e2e_total_tokens': 0,
INFO 01-05 08:14:50 [omni.py:711]  'e2e_avg_time_per_request_ms': 196497.98941612244,
INFO 01-05 08:14:50 [omni.py:711]  'e2e_avg_tokens_per_s': 0.0,
INFO 01-05 08:14:50 [omni.py:711]  'wall_time_ms': 196499.73726272583,
INFO 01-05 08:14:50 [omni.py:711]  'final_stage_id': {'0_e356bce6-53d4-4d75-881c-dee30a6f6507': 0},
INFO 01-05 08:14:50 [omni.py:711]  'stages': [{'stage_id': 0,
INFO 01-05 08:14:50 [omni.py:711]              'requests': 1,
INFO 01-05 08:14:50 [omni.py:711]              'tokens': 0,
INFO 01-05 08:14:50 [omni.py:711]              'total_time_ms': 196498.29936027527,
INFO 01-05 08:14:50 [omni.py:711]              'avg_time_per_request_ms': 196498.29936027527,
INFO 01-05 08:14:50 [omni.py:711]              'avg_tokens_per_s': 0.0}],
INFO 01-05 08:14:50 [omni.py:711]  'transfers': []}
Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [03:16<?, ?it/s]
[Stage-0] INFO 01-05 08:14:50 [omni_stage.py:636] Received shutdown signal
[Stage-0] INFO 01-05 08:14:50 [gpu_worker.py:265] Worker 0: Received shutdown message
[Stage-0] INFO 01-05 08:14:50 [gpu_worker.py:287] event loop terminated.
[Stage-0] INFO 01-05 08:14:51 [gpu_worker.py:318] Worker 0: Shutdown complete.
Total generation time: 200.7092 seconds (200709.22 ms)
INFO 01-05 08:14:55 [image_edit.py:349] Outputs: [OmniRequestOutput(request_id='', finished=True, stage_id=0, final_output_type='image', request_output=[OmniRequestOutput(request_id='0_e356bce6-53d4-4d75-881c-dee30a6f6507', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt="Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting", latents=None, metrics={})], images=[], prompt=None, latents=None, metrics={})]
Saved edited image to /workspace/dev/vipshop/vllm-omni/examples/offline_inference/image_to_image/output_image_edit_nocache.png

NVIDIA L20x1: NO Cache	NVIDIA L20x1: w/ cache-dit
LongCat-Image: 42.39s	LongCat-Image: 17.12s

DefTruth · 2026-01-05T07:03:18Z

@e1ijah1 Can you also take a look?

david6666666 · 2026-01-05T07:03:23Z

fix DCO, and add test result

Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: DefTruth <qiustudent_r@163.com>

Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk> Signed-off-by: DefTruth <qiustudent_r@163.com>

Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: SamitHuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: DefTruth <qiustudent_r@163.com>

Signed-off-by: DefTruth <qiustudent_r@163.com>

… 'abort' (vllm-project#624) Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: DefTruth <qiustudent_r@163.com>

e1ijah1 · 2026-01-05T07:12:16Z

We could also bump the cache-dit version as well.

DefTruth · 2026-01-05T08:16:30Z

fix DCO, and add test result

done

Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: DefTruth <qiustudent_r@163.com> Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk> Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: SamitHuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com> Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com> Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk> Co-authored-by: Samit <285365963@qq.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com>

DefTruth requested a review from hsliuustc0106 as a code owner January 5, 2026 06:59

princepride and others added 5 commits January 5, 2026 07:09

[Bagel]Add image edit (vllm-project#588)

205ad8f

Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: DefTruth <qiustudent_r@163.com>

Dev/add i2i bash (vllm-project#623)

d0b0939

Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk> Signed-off-by: DefTruth <qiustudent_r@163.com>

bugfix: fix longcat-image cache dispatch

2404123

Signed-off-by: DefTruth <qiustudent_r@163.com>

[BugFix] AttributeError: 'AsyncOmniDiffusion' object has no attribute…

cf6fae7

… 'abort' (vllm-project#624) Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: DefTruth <qiustudent_r@163.com>

DefTruth force-pushed the fix-cache-longcat branch from eff8721 to cf6fae7 Compare January 5, 2026 07:10

Merge branch 'main' into fix-cache-longcat

c58c825

SamitHuang approved these changes Jan 5, 2026

View reviewed changes

SamitHuang added the ready label to trigger buildkite CI label Jan 5, 2026

david6666666 enabled auto-merge (squash) January 5, 2026 07:51

hsliuustc0106 approved these changes Jan 5, 2026

View reviewed changes

hsliuustc0106 disabled auto-merge January 5, 2026 08:25

hsliuustc0106 merged commit 927d952 into vllm-project:main Jan 5, 2026
7 checks passed

DefTruth deleted the fix-cache-longcat branch January 20, 2026 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix: fix longcat-image cache dispatch#638

bugfix: fix longcat-image cache dispatch#638
hsliuustc0106 merged 6 commits intovllm-project:mainfrom
xlite-dev:fix-cache-longcat

DefTruth commented Jan 5, 2026 •

edited

Loading

Uh oh!

DefTruth commented Jan 5, 2026

Uh oh!

david6666666 commented Jan 5, 2026

Uh oh!

e1ijah1 commented Jan 5, 2026

Uh oh!

DefTruth commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

DefTruth commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

w/ this pr

Uh oh!

DefTruth commented Jan 5, 2026

Uh oh!

david6666666 commented Jan 5, 2026

Uh oh!

e1ijah1 commented Jan 5, 2026

Uh oh!

DefTruth commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

DefTruth commented Jan 5, 2026 •

edited

Loading