Skip to content

bugfix: fix longcat-image cache dispatch#638

Merged
hsliuustc0106 merged 6 commits intovllm-project:mainfrom
xlite-dev:fix-cache-longcat
Jan 5, 2026
Merged

bugfix: fix longcat-image cache dispatch#638
hsliuustc0106 merged 6 commits intovllm-project:mainfrom
xlite-dev:fix-cache-longcat

Conversation

@DefTruth
Copy link
Contributor

@DefTruth DefTruth commented Jan 5, 2026

fixed #630

pipeline name mismatch leading to wrong cache dispatch.

w/ this pr

  • w/ cache: 17.122s
python text_to_image.py \
    --model $LONGCAT_IMAGE_DIR \
    --prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting" \
    --output output_image_edit.png \
    --num_inference_steps 50 \
    --cfg_scale 4.0 \
    --cache_backend cache_dit
WARNING 01-05 07:48:45 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
INFO 01-05 07:48:46 [omni.py:122] Initializing stages for model: /workspace/dev/vipdev/hf_models/LongCat-Image
INFO 01-05 07:48:46 [initialization.py:35] No OmniTransferConfig provided
INFO 01-05 07:48:46 [omni_stage.py:107] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'diffusion', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model': '/workspace/dev/vipdev/hf_models/LongCat-Image', 'vae_use_slicing': False, 'vae_use_tiling': False, 'cache_backend': 'cache_dit', 'cache_config': {'Fn_compute_blocks': 1, 'Bn_compute_blocks': 0, 'max_warmup_steps': 4, 'residual_diff_threshold': 0.24, 'max_continuous_cached_steps': 3, 'enable_taylorseer': False, 'taylorseer_order': 1, 'scm_steps_mask_policy': None, 'scm_steps_policy': 'dynamic'}, 'parallel_config': {'pipeline_parallel_size': 1, 'data_parallel_size': 1, 'tensor_parallel_size': 1, 'sequence_parallel_size': 1, 'ulysses_degree': 1, 'ring_degree': 1, 'cfg_parallel_size': 1}, 'model_stage': 'diffusion'}, 'final_output': True, 'final_output_type': 'image'}
INFO 01-05 07:48:46 [omni.py:297] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] WARNING 01-05 07:48:54 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] INFO 01-05 07:48:54 [omni_stage.py:434] Starting stage worker with model: /workspace/dev/vipdev/hf_models/LongCat-Image
[Stage-0] WARNING 01-05 07:48:55 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 07:48:55 [diffusion_engine.py:213] Starting server...
[Stage-0] WARNING 01-05 07:49:03 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-05 07:49:04 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 07:49:04 [gpu_worker.py:174] Worker 0 created result MessageQueue
[Stage-0] INFO 01-05 07:49:04 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
[W105 07:49:04.723704639 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [localhost]:30089 (errno: 97 - Address family not supported by protocol).
[W105 07:49:04.724641872 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [10.189.108.254]:30089 (errno: 97 - Address family not supported by protocol).
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 01-05 07:49:04 [gpu_worker.py:75] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.02it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.44s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.44s/it]

[Stage-0] INFO 01-05 07:49:15 [diffusers_loader.py:214] Loading weights took 3.87 seconds
[Stage-0] INFO 01-05 07:49:16 [gpu_worker.py:97] Model loading took 37.7358 GiB and 11.258677 seconds
[Stage-0] INFO 01-05 07:49:16 [gpu_worker.py:102] Worker 0: Model loaded successfully.
[Stage-0] INFO 01-05 07:49:16 [cache_dit_backend.py:468] Using custom cache-dit enabler for model: LongCatImagePipeline
[Stage-0] INFO 01-05 07:49:16 [cache_dit_backend.py:211] Enabling cache-dit on LongCatImage transformer with BlockAdapter: Fn=1, Bn=0, W=4,
WARNING 01-05 07:49:16 [block_adapters.py:131] pipe is None, use FakeDiffusionPipeline instead.
INFO 01-05 07:49:16 [block_adapters.py:220] Auto fill blocks_name: ['transformer_blocks', 'single_transformer_blocks'].
INFO 01-05 07:49:16 [block_adapters.py:153] Found transformer NOT from diffusers: vllm_omni.diffusion.models.longcat_image.longcat_image_transformer disable check_forward_pattern by default.
INFO 01-05 07:49:16 [cache_adapter.py:77] Adapting Cache Acceleration using custom BlockAdapter!
WARNING 01-05 07:49:16 [block_adapters.py:469] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
WARNING 01-05 07:49:16 [block_adapters.py:469] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 07:49:16 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FakeDiffusionPipeline.
INFO 01-05 07:49:16 [cache_adapter.py:332] Collected Context Config: DBCache_F1B0_W4I1M0MC3_R0.24_CFG0, Calibrator Config: None
WARNING 01-05 07:49:16 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 07:49:16 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140031439747024, context_manager: FakeDiffusionPipeline_140031439852352.
WARNING 01-05 07:49:16 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 07:49:16 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for single_transformer_blocks, cache_context: single_transformer_blocks_140031439017008, context_manager: FakeDiffusionPipeline_140031439852352.
[Stage-0] INFO 01-05 07:49:16 [cache_dit_backend.py:475] Cache-dit enabled successfully on LongCatImagePipeline
[Stage-0] INFO 01-05 07:49:16 [gpu_worker.py:310] Worker 0: Scheduler loop started.
[Stage-0] INFO 01-05 07:49:16 [gpu_worker.py:233] Worker 0 ready to receive requests via shared memory
[Stage-0] INFO 01-05 07:49:16 [scheduler.py:46] SyncScheduler initialized result MessageQueue
[Stage-0] INFO 01-05 07:49:16 [omni_stage.py:627] Max batch size: 1
INFO 01-05 07:49:16 [omni.py:290] [Orchestrator] Stage-0 reported ready
INFO 01-05 07:49:16 [omni.py:316] [Orchestrator] All stages initialized successfully

============================================================
Generation Configuration:
  Model: /workspace/dev/vipdev/hf_models/LongCat-Image
  Inference steps: 50
  Cache backend: cache_dit
  Parallel configuration: ulysses_degree=1, ring_degree=1
  Image size: 1024x1024
============================================================

Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s[Stage-0] INFO 01-05 07:49:16 [omni_diffusion.py:112] Prepared 1 requests for generation.                                                                                        | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s]
[Stage-0] INFO 01-05 07:49:16 [cache_dit_backend.py:496] Refreshing cache context for transformer with num_inference_steps: 50
INFO 01-05 07:49:16 [cache_adapter.py:723] ✅ Refreshed cache context: transformer_blocks_140031439747024, Collected Context Config: DBCache_F1B0_W4I1M0MC3_R0.24_N50_CFG0, Calibrator Config: None
INFO 01-05 07:49:16 [cache_adapter.py:723] ✅ Refreshed cache context: single_transformer_blocks_140031439017008, Collected Context Config: DBCache_F1B0_W4I1M0MC3_R0.24_N50_CFG0, Calibrator Config: None
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[Stage-0] INFO 01-05 07:49:33 [diffusion_engine.py:86] Generation completed successfully.
[Stage-0] INFO 01-05 07:49:33 [diffusion_engine.py:109] Post-processing completed in 0.0955 seconds
INFO 01-05 07:49:33 [log_utils.py:549] {'type': 'request_level_metrics',
INFO 01-05 07:49:33 [log_utils.py:549]  'request_id': '0_45567883-3c2e-4cfa-a36e-9aa752e94fb0',
INFO 01-05 07:49:33 [log_utils.py:549]  'e2e_time_ms': 17122.5323677063,
INFO 01-05 07:49:33 [log_utils.py:549]  'e2e_tpt': 0.0,
INFO 01-05 07:49:33 [log_utils.py:549]  'e2e_total_tokens': 0,
INFO 01-05 07:49:33 [log_utils.py:549]  'transfers_total_time_ms': 0.0,
INFO 01-05 07:49:33 [log_utils.py:549]  'transfers_total_bytes': 0,
INFO 01-05 07:49:33 [log_utils.py:549]  'stages': {0: {'stage_gen_time_ms': 17087.39161491394,
INFO 01-05 07:49:33 [log_utils.py:549]                 'num_tokens_out': 0,
INFO 01-05 07:49:33 [log_utils.py:549]                 'num_tokens_in': 0}}}
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:17<00:00, 17.12s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 07:49:33 [omni.py:711] [Summary] {'e2e_requests': 1,████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:17<00:00, 17.12s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 07:49:33 [omni.py:711]  'e2e_total_time_ms': 17123.9755153656,
INFO 01-05 07:49:33 [omni.py:711]  'e2e_sum_time_ms': 17122.5323677063,
INFO 01-05 07:49:33 [omni.py:711]  'e2e_total_tokens': 0,
INFO 01-05 07:49:33 [omni.py:711]  'e2e_avg_time_per_request_ms': 17122.5323677063,
INFO 01-05 07:49:33 [omni.py:711]  'e2e_avg_tokens_per_s': 0.0,
INFO 01-05 07:49:33 [omni.py:711]  'wall_time_ms': 17123.9755153656,
INFO 01-05 07:49:33 [omni.py:711]  'final_stage_id': {'0_45567883-3c2e-4cfa-a36e-9aa752e94fb0': 0},
INFO 01-05 07:49:33 [omni.py:711]  'stages': [{'stage_id': 0,
INFO 01-05 07:49:33 [omni.py:711]              'requests': 1,
INFO 01-05 07:49:33 [omni.py:711]              'tokens': 0,
INFO 01-05 07:49:33 [omni.py:711]              'total_time_ms': 17122.780323028564,
INFO 01-05 07:49:33 [omni.py:711]              'avg_time_per_request_ms': 17122.780323028564,
INFO 01-05 07:49:33 [omni.py:711]              'avg_tokens_per_s': 0.0}],
INFO 01-05 07:49:33 [omni.py:711]  'transfers': []}
Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:17<?, ?it/s]
[Stage-0] INFO 01-05 07:49:33 [omni_stage.py:636] Received shutdown signal
[Stage-0] INFO 01-05 07:49:33 [gpu_worker.py:265] Worker 0: Received shutdown message
[Stage-0] INFO 01-05 07:49:33 [gpu_worker.py:287] event loop terminated.
[Stage-0] INFO 01-05 07:49:33 [gpu_worker.py:318] Worker 0: Shutdown complete.
Total generation time: 21.5126 seconds (21512.61 ms)
INFO 01-05 07:49:37 [text_to_image.py:168] Outputs: [OmniRequestOutput(request_id='', finished=True, stage_id=0, final_output_type='image', request_output=[OmniRequestOutput(request_id='0_45567883-3c2e-4cfa-a36e-9aa752e94fb0', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt="Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting", latents=None, metrics={})], images=[], prompt=None, latents=None, metrics={})]
Saved generated image to output_image_edit.png

w/o cache: 42.39s

python text_to_image.py \
    --model $LONGCAT_IMAGE_DIR \
    --prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting" \
    --output output_image_edit_nocache.png \
    --num_inference_steps 50 \
    --cfg_scale 4.0
WARNING 01-05 07:52:18 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
INFO 01-05 07:52:18 [omni.py:122] Initializing stages for model: /workspace/dev/vipdev/hf_models/LongCat-Image
INFO 01-05 07:52:18 [initialization.py:35] No OmniTransferConfig provided
INFO 01-05 07:52:18 [omni_stage.py:107] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'diffusion', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model': '/workspace/dev/vipdev/hf_models/LongCat-Image', 'vae_use_slicing': False, 'vae_use_tiling': False, 'cache_backend': None, 'cache_config': None, 'parallel_config': {'pipeline_parallel_size': 1, 'data_parallel_size': 1, 'tensor_parallel_size': 1, 'sequence_parallel_size': 1, 'ulysses_degree': 1, 'ring_degree': 1, 'cfg_parallel_size': 1}, 'model_stage': 'diffusion'}, 'final_output': True, 'final_output_type': 'image'}
INFO 01-05 07:52:18 [omni.py:297] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] WARNING 01-05 07:52:27 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] INFO 01-05 07:52:27 [omni_stage.py:434] Starting stage worker with model: /workspace/dev/vipdev/hf_models/LongCat-Image
[Stage-0] WARNING 01-05 07:52:28 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 07:52:28 [diffusion_engine.py:213] Starting server...
[Stage-0] WARNING 01-05 07:52:36 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-05 07:52:37 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 07:52:37 [gpu_worker.py:174] Worker 0 created result MessageQueue
[Stage-0] INFO 01-05 07:52:37 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
[W105 07:52:37.848827436 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [localhost]:30063 (errno: 97 - Address family not supported by protocol).
[W105 07:52:37.850094326 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [10.189.108.254]:30063 (errno: 97 - Address family not supported by protocol).
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 01-05 07:52:37 [gpu_worker.py:75] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.09it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.90s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.90s/it]

[Stage-0] INFO 01-05 07:52:48 [diffusers_loader.py:214] Loading weights took 3.82 seconds
[Stage-0] INFO 01-05 07:52:48 [gpu_worker.py:97] Model loading took 37.7358 GiB and 10.921257 seconds
[Stage-0] INFO 01-05 07:52:48 [gpu_worker.py:102] Worker 0: Model loaded successfully.
[Stage-0] INFO 01-05 07:52:48 [gpu_worker.py:310] Worker 0: Scheduler loop started.
[Stage-0] INFO 01-05 07:52:48 [gpu_worker.py:233] Worker 0 ready to receive requests via shared memory
[Stage-0] INFO 01-05 07:52:48 [scheduler.py:46] SyncScheduler initialized result MessageQueue
[Stage-0] INFO 01-05 07:52:48 [omni_stage.py:627] Max batch size: 1
INFO 01-05 07:52:48 [omni.py:290] [Orchestrator] Stage-0 reported ready
INFO 01-05 07:52:48 [omni.py:316] [Orchestrator] All stages initialized successfully

============================================================
Generation Configuration:
  Model: /workspace/dev/vipdev/hf_models/LongCat-Image
  Inference steps: 50
  Cache backend: None (no acceleration)
  Parallel configuration: ulysses_degree=1, ring_degree=1
  Image size: 1024x1024
============================================================

Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s[Stage-0] INFO 01-05 07:52:48 [omni_diffusion.py:112] Prepared 1 requests for generation.                                                                                        | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s]
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[Stage-0] INFO 01-05 07:53:31 [diffusion_engine.py:86] Generation completed successfully.
[Stage-0] INFO 01-05 07:53:31 [diffusion_engine.py:109] Post-processing completed in 0.0922 seconds
INFO 01-05 07:53:31 [log_utils.py:549] {'type': 'request_level_metrics',
INFO 01-05 07:53:31 [log_utils.py:549]  'request_id': '0_2dc245b9-ac08-43d9-ae97-23e33938d612',
INFO 01-05 07:53:31 [log_utils.py:549]  'e2e_time_ms': 42394.21105384827,
INFO 01-05 07:53:31 [log_utils.py:549]  'e2e_tpt': 0.0,
INFO 01-05 07:53:31 [log_utils.py:549]  'e2e_total_tokens': 0,
INFO 01-05 07:53:31 [log_utils.py:549]  'transfers_total_time_ms': 0.0,
INFO 01-05 07:53:31 [log_utils.py:549]  'transfers_total_bytes': 0,
INFO 01-05 07:53:31 [log_utils.py:549]  'stages': {0: {'stage_gen_time_ms': 42355.54218292236,
INFO 01-05 07:53:31 [log_utils.py:549]                 'num_tokens_out': 0,
INFO 01-05 07:53:31 [log_utils.py:549]                 'num_tokens_in': 0}}}
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:42<00:00, 42.39s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 07:53:31 [omni.py:711] [Summary] {'e2e_requests': 1,████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:42<00:00, 42.39s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 07:53:31 [omni.py:711]  'e2e_total_time_ms': 42396.44718170166,
INFO 01-05 07:53:31 [omni.py:711]  'e2e_sum_time_ms': 42394.21105384827,
INFO 01-05 07:53:31 [omni.py:711]  'e2e_total_tokens': 0,
INFO 01-05 07:53:31 [omni.py:711]  'e2e_avg_time_per_request_ms': 42394.21105384827,
INFO 01-05 07:53:31 [omni.py:711]  'e2e_avg_tokens_per_s': 0.0,
INFO 01-05 07:53:31 [omni.py:711]  'wall_time_ms': 42396.44718170166,
INFO 01-05 07:53:31 [omni.py:711]  'final_stage_id': {'0_2dc245b9-ac08-43d9-ae97-23e33938d612': 0},
INFO 01-05 07:53:31 [omni.py:711]  'stages': [{'stage_id': 0,
INFO 01-05 07:53:31 [omni.py:711]              'requests': 1,
INFO 01-05 07:53:31 [omni.py:711]              'tokens': 0,
INFO 01-05 07:53:31 [omni.py:711]              'total_time_ms': 42394.585609436035,
INFO 01-05 07:53:31 [omni.py:711]              'avg_time_per_request_ms': 42394.585609436035,
INFO 01-05 07:53:31 [omni.py:711]              'avg_tokens_per_s': 0.0}],
INFO 01-05 07:53:31 [omni.py:711]  'transfers': []}
Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:42<?, ?it/s]
[Stage-0] INFO 01-05 07:53:31 [omni_stage.py:636] Received shutdown signal
[Stage-0] INFO 01-05 07:53:31 [gpu_worker.py:265] Worker 0: Received shutdown message
[Stage-0] INFO 01-05 07:53:31 [gpu_worker.py:287] event loop terminated.
[Stage-0] INFO 01-05 07:53:31 [gpu_worker.py:318] Worker 0: Shutdown complete.
Total generation time: 45.9301 seconds (45930.09 ms)
INFO 01-05 07:53:34 [text_to_image.py:168] Outputs: [OmniRequestOutput(request_id='', finished=True, stage_id=0, final_output_type='image', request_output=[OmniRequestOutput(request_id='0_2dc245b9-ac08-43d9-ae97-23e33938d612', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt="Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting", latents=None, metrics={})], images=[], prompt=None, latents=None, metrics={})]
Saved generated image to output_image_edit_nocache.png

longcat-image-edit w/ cache-dit: 80.8s

 python3 image_edit.py --image qwen_bear.png --prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting" --output output_image_edit.png --num_inference_steps 50 --guidance_scale 4.5 --seed 42 --model $LONGCAT_IMAGE_EDIT_DIR  --cache_backend cache_dit  --cache_dit_max_continuous_cached_steps 2
WARNING 01-05 08:08:26 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
INFO 01-05 08:08:26 [omni.py:122] Initializing stages for model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
INFO 01-05 08:08:26 [initialization.py:35] No OmniTransferConfig provided
INFO 01-05 08:08:26 [omni_stage.py:107] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'diffusion', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model': '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit', 'vae_use_slicing': False, 'vae_use_tiling': False, 'cache_backend': 'cache_dit', 'cache_config': {'Fn_compute_blocks': 1, 'Bn_compute_blocks': 0, 'max_warmup_steps': 4, 'residual_diff_threshold': 0.24, 'max_continuous_cached_steps': 2, 'enable_taylorseer': False, 'taylorseer_order': 1, 'scm_steps_mask_policy': None, 'scm_steps_policy': 'dynamic'}, 'parallel_config': {'pipeline_parallel_size': 1, 'data_parallel_size': 1, 'tensor_parallel_size': 1, 'sequence_parallel_size': 1, 'ulysses_degree': 1, 'ring_degree': 1, 'cfg_parallel_size': 1}, 'model_stage': 'diffusion'}, 'final_output': True, 'final_output_type': 'image'}
INFO 01-05 08:08:26 [omni.py:297] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] WARNING 01-05 08:08:34 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] INFO 01-05 08:08:35 [omni_stage.py:434] Starting stage worker with model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
[Stage-0] WARNING 01-05 08:08:35 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 08:08:36 [diffusion_engine.py:213] Starting server...
[Stage-0] WARNING 01-05 08:08:44 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-05 08:08:45 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 08:08:45 [gpu_worker.py:174] Worker 0 created result MessageQueue
[Stage-0] INFO 01-05 08:08:45 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
[W105 08:08:45.572301128 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [localhost]:30008 (errno: 97 - Address family not supported by protocol).
[W105 08:08:45.573489541 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [10.189.108.254]:30008 (errno: 97 - Address family not supported by protocol).
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 01-05 08:08:45 [gpu_worker.py:75] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.06s/it]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.76s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.76s/it]

[Stage-0] INFO 01-05 08:08:57 [diffusers_loader.py:214] Loading weights took 4.41 seconds
[Stage-0] INFO 01-05 08:08:57 [gpu_worker.py:97] Model loading took 37.7358 GiB and 12.267342 seconds
[Stage-0] INFO 01-05 08:08:57 [gpu_worker.py:102] Worker 0: Model loaded successfully.
[Stage-0] INFO 01-05 08:08:57 [cache_dit_backend.py:468] Using custom cache-dit enabler for model: LongCatImageEditPipeline
[Stage-0] INFO 01-05 08:08:57 [cache_dit_backend.py:211] Enabling cache-dit on LongCatImage transformer with BlockAdapter: Fn=1, Bn=0, W=4,
WARNING 01-05 08:08:57 [block_adapters.py:131] pipe is None, use FakeDiffusionPipeline instead.
INFO 01-05 08:08:57 [block_adapters.py:220] Auto fill blocks_name: ['transformer_blocks', 'single_transformer_blocks'].
INFO 01-05 08:08:57 [block_adapters.py:153] Found transformer NOT from diffusers: vllm_omni.diffusion.models.longcat_image.longcat_image_transformer disable check_forward_pattern by default.
INFO 01-05 08:08:57 [cache_adapter.py:77] Adapting Cache Acceleration using custom BlockAdapter!
WARNING 01-05 08:08:57 [block_adapters.py:469] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
WARNING 01-05 08:08:57 [block_adapters.py:469] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 08:08:57 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FakeDiffusionPipeline.
INFO 01-05 08:08:57 [cache_adapter.py:332] Collected Context Config: DBCache_F1B0_W4I1M0MC2_R0.24_CFG0, Calibrator Config: None
WARNING 01-05 08:08:57 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 08:08:57 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140345843095872, context_manager: FakeDiffusionPipeline_140346848452944.
WARNING 01-05 08:08:57 [pattern_base.py:78] Skipped Forward Pattern Check: ForwardPattern.Pattern_1
INFO 01-05 08:08:57 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for single_transformer_blocks, cache_context: single_transformer_blocks_140345842854144, context_manager: FakeDiffusionPipeline_140346848452944.
[Stage-0] INFO 01-05 08:08:57 [cache_dit_backend.py:475] Cache-dit enabled successfully on LongCatImageEditPipeline
[Stage-0] INFO 01-05 08:08:57 [gpu_worker.py:310] Worker 0: Scheduler loop started.
[Stage-0] INFO 01-05 08:08:57 [gpu_worker.py:233] Worker 0 ready to receive requests via shared memory
[Stage-0] INFO 01-05 08:08:57 [scheduler.py:46] SyncScheduler initialized result MessageQueue
[Stage-0] INFO 01-05 08:08:57 [omni_stage.py:627] Max batch size: 1
INFO 01-05 08:08:57 [omni.py:290] [Orchestrator] Stage-0 reported ready
INFO 01-05 08:08:57 [omni.py:316] [Orchestrator] All stages initialized successfully
Pipeline loaded

============================================================
Generation Configuration:
  Model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
  Inference steps: 50
  Cache backend: cache_dit
  Input image size: (232, 282)
  Parallel configuration: ulysses_degree=1, ring_degree=1
============================================================

Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s[Stage-0] INFO 01-05 08:08:57 [omni_diffusion.py:112] Prepared 1 requests for generation.                                                                                        | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s]
[Stage-0] INFO 01-05 08:08:58 [diffusion_engine.py:81] Pre-processing completed in 0.2521 seconds
[Stage-0] INFO 01-05 08:08:58 [cache_dit_backend.py:496] Refreshing cache context for transformer with num_inference_steps: 50
INFO 01-05 08:08:58 [cache_adapter.py:723] ✅ Refreshed cache context: transformer_blocks_140345843095872, Collected Context Config: DBCache_F1B0_W4I1M0MC2_R0.24_N50_CFG0, Calibrator Config: None
INFO 01-05 08:08:58 [cache_adapter.py:723] ✅ Refreshed cache context: single_transformer_blocks_140345842854144, Collected Context Config: DBCache_F1B0_W4I1M0MC2_R0.24_N50_CFG0, Calibrator Config: None
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[Stage-0] INFO 01-05 08:09:58 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
[Stage-0] INFO 01-05 08:10:18 [diffusion_engine.py:86] Generation completed successfully.
[Stage-0] INFO 01-05 08:10:18 [diffusion_engine.py:109] Post-processing completed in 0.0752 seconds
INFO 01-05 08:10:18 [log_utils.py:549] {'type': 'request_level_metrics',
INFO 01-05 08:10:18 [log_utils.py:549]  'request_id': '0_c91d09e9-a7fb-4537-8927-39f6b1484cdb',
INFO 01-05 08:10:18 [log_utils.py:549]  'e2e_time_ms': 80810.90354919434,
INFO 01-05 08:10:18 [log_utils.py:549]  'e2e_tpt': 0.0,
INFO 01-05 08:10:18 [log_utils.py:549]  'e2e_total_tokens': 0,
INFO 01-05 08:10:18 [log_utils.py:549]  'transfers_total_time_ms': 0.0,
INFO 01-05 08:10:18 [log_utils.py:549]  'transfers_total_bytes': 0,
INFO 01-05 08:10:18 [log_utils.py:549]  'stages': {0: {'stage_gen_time_ms': 80774.80435371399,
INFO 01-05 08:10:18 [log_utils.py:549]                 'num_tokens_out': 0,
INFO 01-05 08:10:18 [log_utils.py:549]                 'num_tokens_in': 0}}}
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:20<00:00, 80.81s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 08:10:18 [omni.py:711] [Summary] {'e2e_requests': 1,████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:20<00:00, 80.81s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 08:10:18 [omni.py:711]  'e2e_total_time_ms': 80811.88321113586,
INFO 01-05 08:10:18 [omni.py:711]  'e2e_sum_time_ms': 80810.90354919434,
INFO 01-05 08:10:18 [omni.py:711]  'e2e_total_tokens': 0,
INFO 01-05 08:10:18 [omni.py:711]  'e2e_avg_time_per_request_ms': 80810.90354919434,
INFO 01-05 08:10:18 [omni.py:711]  'e2e_avg_tokens_per_s': 0.0,
INFO 01-05 08:10:18 [omni.py:711]  'wall_time_ms': 80811.88321113586,
INFO 01-05 08:10:18 [omni.py:711]  'final_stage_id': {'0_c91d09e9-a7fb-4537-8927-39f6b1484cdb': 0},
INFO 01-05 08:10:18 [omni.py:711]  'stages': [{'stage_id': 0,
INFO 01-05 08:10:18 [omni.py:711]              'requests': 1,
INFO 01-05 08:10:18 [omni.py:711]              'tokens': 0,
INFO 01-05 08:10:18 [omni.py:711]              'total_time_ms': 80811.10501289368,
INFO 01-05 08:10:18 [omni.py:711]              'avg_time_per_request_ms': 80811.10501289368,
INFO 01-05 08:10:18 [omni.py:711]              'avg_tokens_per_s': 0.0}],
INFO 01-05 08:10:18 [omni.py:711]  'transfers': []}
Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [01:20<?, ?it/s]
[Stage-0] INFO 01-05 08:10:18 [omni_stage.py:636] Received shutdown signal
[Stage-0] INFO 01-05 08:10:18 [gpu_worker.py:265] Worker 0: Received shutdown message
[Stage-0] INFO 01-05 08:10:18 [gpu_worker.py:287] event loop terminated.
[Stage-0] INFO 01-05 08:10:18 [gpu_worker.py:318] Worker 0: Shutdown complete.
Total generation time: 85.0818 seconds (85081.80 ms)
INFO 01-05 08:10:23 [image_edit.py:349] Outputs: [OmniRequestOutput(request_id='', finished=True, stage_id=0, final_output_type='image', request_output=[OmniRequestOutput(request_id='0_c91d09e9-a7fb-4537-8927-39f6b1484cdb', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt="Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting", latents=None, metrics={})], images=[], prompt=None, latents=None, metrics={})]
Saved edited image to /workspace/dev/vipshop/vllm-omni/examples/offline_inference/image_to_image/output_image_edit.png

longcat-image-edit w/o cache-dit: 196.4s

python3 image_edit.py --image qwen_bear.png --prompt "Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting" --output output_image_edit_nocache.png --num_inference_steps 50 --guidance_scale 4.5 --seed 42 --model $LONGCAT_IMAGE_EDIT_DIR
WARNING 01-05 08:11:04 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
INFO 01-05 08:11:05 [omni.py:122] Initializing stages for model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
INFO 01-05 08:11:05 [initialization.py:35] No OmniTransferConfig provided
INFO 01-05 08:11:05 [omni_stage.py:107] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'diffusion', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model': '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit', 'vae_use_slicing': False, 'vae_use_tiling': False, 'cache_backend': None, 'cache_config': None, 'parallel_config': {'pipeline_parallel_size': 1, 'data_parallel_size': 1, 'tensor_parallel_size': 1, 'sequence_parallel_size': 1, 'ulysses_degree': 1, 'ring_degree': 1, 'cfg_parallel_size': 1}, 'model_stage': 'diffusion'}, 'final_output': True, 'final_output_type': 'image'}
INFO 01-05 08:11:05 [omni.py:297] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] WARNING 01-05 08:11:13 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] INFO 01-05 08:11:13 [omni_stage.py:434] Starting stage worker with model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
[Stage-0] WARNING 01-05 08:11:14 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 08:11:15 [diffusion_engine.py:213] Starting server...
[Stage-0] WARNING 01-05 08:11:22 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
[Stage-0] WARNING 01-05 08:11:24 [envs.py:194] Flash Attention library "flash_attn" not found, using pytorch attention implementation
[Stage-0] INFO 01-05 08:11:24 [gpu_worker.py:174] Worker 0 created result MessageQueue
[Stage-0] INFO 01-05 08:11:24 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
[W105 08:11:24.162905591 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [localhost]:30038 (errno: 97 - Address family not supported by protocol).
[W105 08:11:24.163894440 socket.cpp:767] [c10d] The client socket cannot be initialized to connect to [10.189.108.254]:30038 (errno: 97 - Address family not supported by protocol).
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 01-05 08:11:24 [gpu_worker.py:75] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.18it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/workspace/dev/vipdev/hf_models/LongCat-Image-Edit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.43s/it]

[Stage-0] INFO 01-05 08:11:33 [diffusers_loader.py:214] Loading weights took 3.47 seconds
[Stage-0] INFO 01-05 08:11:34 [gpu_worker.py:97] Model loading took 37.7358 GiB and 10.241416 seconds
[Stage-0] INFO 01-05 08:11:34 [gpu_worker.py:102] Worker 0: Model loaded successfully.
[Stage-0] INFO 01-05 08:11:34 [gpu_worker.py:310] Worker 0: Scheduler loop started.
[Stage-0] INFO 01-05 08:11:34 [gpu_worker.py:233] Worker 0 ready to receive requests via shared memory
[Stage-0] INFO 01-05 08:11:34 [scheduler.py:46] SyncScheduler initialized result MessageQueue
[Stage-0] INFO 01-05 08:11:34 [omni_stage.py:627] Max batch size: 1
INFO 01-05 08:11:34 [omni.py:290] [Orchestrator] Stage-0 reported ready
INFO 01-05 08:11:34 [omni.py:316] [Orchestrator] All stages initialized successfully
Pipeline loaded

============================================================
Generation Configuration:
  Model: /workspace/dev/vipdev/hf_models/LongCat-Image-Edit
  Inference steps: 50
  Cache backend: None (no acceleration)
  Input image size: (232, 282)
  Parallel configuration: ulysses_degree=1, ring_degree=1
============================================================

Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [00:00<?, ?it/s[Stage-0] INFO 01-05 08:11:34 [omni_diffusion.py:112] Prepared 1 requests for generation.                                                                                        | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s]
[Stage-0] INFO 01-05 08:11:34 [diffusion_engine.py:81] Pre-processing completed in 0.0889 seconds
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[Stage-0] INFO 01-05 08:12:34 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
[Stage-0] INFO 01-05 08:13:34 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
[Stage-0] INFO 01-05 08:14:34 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
[Stage-0] INFO 01-05 08:14:50 [diffusion_engine.py:86] Generation completed successfully.
[Stage-0] INFO 01-05 08:14:50 [diffusion_engine.py:109] Post-processing completed in 0.0811 seconds
INFO 01-05 08:14:50 [log_utils.py:549] {'type': 'request_level_metrics',
INFO 01-05 08:14:50 [log_utils.py:549]  'request_id': '0_e356bce6-53d4-4d75-881c-dee30a6f6507',
INFO 01-05 08:14:50 [log_utils.py:549]  'e2e_time_ms': 196497.98941612244,
INFO 01-05 08:14:50 [log_utils.py:549]  'e2e_tpt': 0.0,
INFO 01-05 08:14:50 [log_utils.py:549]  'e2e_total_tokens': 0,
INFO 01-05 08:14:50 [log_utils.py:549]  'transfers_total_time_ms': 0.0,
INFO 01-05 08:14:50 [log_utils.py:549]  'transfers_total_bytes': 0,
INFO 01-05 08:14:50 [log_utils.py:549]  'stages': {0: {'stage_gen_time_ms': 196457.38887786865,
INFO 01-05 08:14:50 [log_utils.py:549]                 'num_tokens_out': 0,
INFO 01-05 08:14:50 [log_utils.py:549]                 'num_tokens_in': 0}}}
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [03:16<00:00, 196.50s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 08:14:50 [omni.py:711] [Summary] {'e2e_requests': 1,███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [03:16<00:00, 196.50s/img, est. speed stage-0 img/s: 0.00, avg e2e_lat: 0.0ms]
INFO 01-05 08:14:50 [omni.py:711]  'e2e_total_time_ms': 196499.73726272583,
INFO 01-05 08:14:50 [omni.py:711]  'e2e_sum_time_ms': 196497.98941612244,
INFO 01-05 08:14:50 [omni.py:711]  'e2e_total_tokens': 0,
INFO 01-05 08:14:50 [omni.py:711]  'e2e_avg_time_per_request_ms': 196497.98941612244,
INFO 01-05 08:14:50 [omni.py:711]  'e2e_avg_tokens_per_s': 0.0,
INFO 01-05 08:14:50 [omni.py:711]  'wall_time_ms': 196499.73726272583,
INFO 01-05 08:14:50 [omni.py:711]  'final_stage_id': {'0_e356bce6-53d4-4d75-881c-dee30a6f6507': 0},
INFO 01-05 08:14:50 [omni.py:711]  'stages': [{'stage_id': 0,
INFO 01-05 08:14:50 [omni.py:711]              'requests': 1,
INFO 01-05 08:14:50 [omni.py:711]              'tokens': 0,
INFO 01-05 08:14:50 [omni.py:711]              'total_time_ms': 196498.29936027527,
INFO 01-05 08:14:50 [omni.py:711]              'avg_time_per_request_ms': 196498.29936027527,
INFO 01-05 08:14:50 [omni.py:711]              'avg_tokens_per_s': 0.0}],
INFO 01-05 08:14:50 [omni.py:711]  'transfers': []}
Adding requests:   0%|                                                                                                                                                                                                              | 0/1 [03:16<?, ?it/s]
[Stage-0] INFO 01-05 08:14:50 [omni_stage.py:636] Received shutdown signal
[Stage-0] INFO 01-05 08:14:50 [gpu_worker.py:265] Worker 0: Received shutdown message
[Stage-0] INFO 01-05 08:14:50 [gpu_worker.py:287] event loop terminated.
[Stage-0] INFO 01-05 08:14:51 [gpu_worker.py:318] Worker 0: Shutdown complete.
Total generation time: 200.7092 seconds (200709.22 ms)
INFO 01-05 08:14:55 [image_edit.py:349] Outputs: [OmniRequestOutput(request_id='', finished=True, stage_id=0, final_output_type='image', request_output=[OmniRequestOutput(request_id='0_e356bce6-53d4-4d75-881c-dee30a6f6507', finished=True, stage_id=None, final_output_type='image', request_output=None, images=[1 PIL Images], prompt="Add a white art board written with colorful text 'vLLM-Omni' on grassland. Add a paintbrush in the bear's hands. position the bear standing in front of the art board as if painting", latents=None, metrics={})], images=[], prompt=None, latents=None, metrics={})]
Saved edited image to /workspace/dev/vipshop/vllm-omni/examples/offline_inference/image_to_image/output_image_edit_nocache.png
NVIDIA L20x1: NO Cache NVIDIA L20x1: w/ cache-dit
LongCat-Image: 42.39s LongCat-Image: 17.12s
output_image_edit_nocache output_image_edit

@DefTruth
Copy link
Contributor Author

DefTruth commented Jan 5, 2026

@e1ijah1 Can you also take a look?

@david6666666
Copy link
Collaborator

fix DCO, and add test result

princepride and others added 5 commits January 5, 2026 07:09
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: DefTruth <qiustudent_r@163.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>
Signed-off-by: DefTruth <qiustudent_r@163.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: DefTruth <qiustudent_r@163.com>
Signed-off-by: DefTruth <qiustudent_r@163.com>
… 'abort' (vllm-project#624)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Signed-off-by: DefTruth <qiustudent_r@163.com>
@e1ijah1
Copy link
Contributor

e1ijah1 commented Jan 5, 2026

We could also bump the cache-dit version as well.

@SamitHuang SamitHuang added the ready label to trigger buildkite CI label Jan 5, 2026
@david6666666 david6666666 enabled auto-merge (squash) January 5, 2026 07:51
@DefTruth
Copy link
Contributor Author

DefTruth commented Jan 5, 2026

fix DCO, and add test result

done

@hsliuustc0106 hsliuustc0106 disabled auto-merge January 5, 2026 08:25
@hsliuustc0106 hsliuustc0106 merged commit 927d952 into vllm-project:main Jan 5, 2026
7 checks passed
tzhouam added a commit to tzhouam/vllm-omni that referenced this pull request Jan 6, 2026
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: DefTruth <qiustudent_r@163.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com>
princepride added a commit to princepride/vllm-omni that referenced this pull request Jan 10, 2026
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: DefTruth <qiustudent_r@163.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com>
sniper35 pushed a commit to sniper35/vllm-omni that referenced this pull request Jan 10, 2026
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: DefTruth <qiustudent_r@163.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com>
ZJY0516 pushed a commit to LawJarp-A/vllm-omni that referenced this pull request Jan 10, 2026
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: DefTruth <qiustudent_r@163.com>
Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Zhou Taichang <tzhouam@connect.ust.hk>
Co-authored-by: Samit <285365963@qq.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Co-authored-by: Ziming Huang <hzm414167@alibaba-inc.com>
@DefTruth DefTruth deleted the fix-cache-longcat branch January 20, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Cache DiT not supported by LongCat-Image

8 participants