Skip to content

[Feature] teacache integration#179

Merged
ZJY0516 merged 44 commits intovllm-project:mainfrom
LawJarp-A:feature/teacache-integration
Dec 12, 2025
Merged

[Feature] teacache integration#179
ZJY0516 merged 44 commits intovllm-project:mainfrom
LawJarp-A:feature/teacache-integration

Conversation

@LawJarp-A
Copy link
Contributor

@LawJarp-A LawJarp-A commented Dec 3, 2025

For #175

Purpose

Integrate TeaCache (Timestep Embedding Aware Cache) into vllm-omni to speed up diffusion inference (~1.5–2x) with minimal quality loss by reusing transformer block computations when consecutive timestep embeddings are similar.

Design

Architecture

vllm_omni/diffusion/
├── cache/teacache/
│   ├── config.py         # TeaCacheConfig (thresholds, coefficients)
│   ├── state.py          # Cache state management
│   ├── extractors.py     # Model-specific extractor registry
│   ├── hook.py           # Forward pass interception
│   └── adapter.py        # CacheAdapter implementation
├── hooks.py              # Hook infrastructure
└── models/qwen_image/
    └── pipeline_qwen_image.py  # Cache setup & reset

How it works:

  1. Hook intercepts transformer forward pass (no model changes needed)
  2. Extract modulated input from first transformer block
  3. Compute L1 distance between consecutive timesteps
  4. Decision: Below threshold → reuse cache; Above → compute & update cache
  5. CFG-aware: Separate states for positive/negative prompts

Adding New Models

Only 5 lines needed to support a new model:

# vllm_omni/diffusion/cache/teacache/extractors.py

def extract_flux_modulated_input(module, hidden_states, temb):
    """Extract modulated input for FLUX models."""
    return module.transformer_blocks[0].norm1(hidden_states, emb=temb)[0]

# Register it
EXTRACTOR_REGISTRY["FluxPipeline"] = extract_flux_modulated_input

Test Plan

  • Functional: Verify correctness with/without cache
  • Performance: Benchmark across thresholds (0.2, 0.4, 0.6)
  • Quality: Visual comparison of generated images

Test Results

Performance (CUDA, Qwen/Qwen-Image, 50 steps, 512×512)

Configuration Time Speedup Quality
Baseline (no cache) 6.52s ± 0.01s 1.00x Reference
thresh=0.2 (balanced) 5.16s ± 0.01s 1.26x ✓ Minimal loss
thresh=0.4 (aggressive) 3.23s ± 0.00s 2.02x △ Slight loss

Key Findings:

  • 2.0x speedup achieved (50% time reduction)
  • 1.3x speedup with minimal quality impact

Usage

from vllm_omni.diffusion.data import OmniDiffusionConfig
from vllm_omni.entrypoints.omni import Omni

config = OmniDiffusionConfig(
    model="Qwen/Qwen-Image",
    cache_adapter="tea_cache",
    cache_config={"rel_l1_thresh": 0.2}  # 0.2=balanced, 0.4=fast
)

omni = Omni(od_config=config)
images = omni.generate("a cat", num_inference_steps=50)  # 1.3-2x faster!

With Prompt: 'An apple and a princess"

image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".

  • The test plan, such as providing test command.

  • The test results, such as pasting the results comparison before and after, or e2e results

  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@hsliuustc0106
Copy link
Collaborator

could you display the pngs w/o teacache?

QwenImageTransformer2DModel,
)
from vllm_omni.diffusion.request import OmniDiffusionRequest
from vllm_omni.diffusion.cache.teacache import TeaCacheConfig, apply_teacache
Copy link
Collaborator

@SamitHuang SamitHuang Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using a fixed teacache, i think we should allow users to select different cache method either via omni_diffusion_config or environment variable. It will align the user behavior for selecting attention backend #115

My initial idea is:

  1. user select the cache method by export DIFFUSION_CACHE_ADAPTER=TEA_CACHE (default no cache), the customized metadata for cache (like max_warmup_steps) can be parsed via omni_diffusion_config.cache_config
  2. each cache method inherits a base cache class named as CacheAdapter, which supports feature retrieval, state management, skip-compute judgement, etc.
  3. model developer can easily integrate cache ability by some interface like:
    maybe_apply_cache(self.transformer, cache_config)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am working on something similar now actually. Taking some inspiration from https://github.com/huggingface/diffusers/blob/main/src/diffusers/hooks/faster_cache.py hooks from huggingface diffusers. I am testing the changes, I'll push it soon.

Copy link
Contributor Author

@LawJarp-A LawJarp-A Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have abstracted it work something like this, with separate extractors for each mode, the model file is not touched. We can depreceate the enable_teacache flag. This is extensible to other models

 teacache_config = OmniDiffusionConfig(
        model="Qwen/Qwen-Image",
        cache_adapter="tea_cache",
        cache_config={"rel_l1_thresh": 0.2, "model_type": "QwenImagePipeline"}
    )
    omni_cached = OmniDiffusion(od_config=teacache_config)

@david6666666
Copy link
Collaborator

could you display the pngs w/o teacache?

I think it would be best to provide data such as LPIPS, and Pareto curves would be even better.

Copy link
Member

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance improvement is exciting


# Registry for model-specific extractors
# Key: pipeline/model architecture name
EXTRACTOR_REGISTRY: dict[str, Callable] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use OmniDiffusionConfig.model_class_name(means QwenImagePipeline) as key here.

"""
Get extractor function for given model or model type.

This function auto-detects the appropriate extractor based on:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exact match is enough

import torch.nn as nn


def extract_qwen_modulated_input(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should put it in model files.

You can refer to get_qwen_image_post_process_func in pipeline_qwen_image.py and how we import it in vllm_omni/diffusion/registry.py

f"Please add a handler method for this model."
)

def _handle_qwen_forward(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we need to write model specific forward, right?

@ZJY0516 ZJY0516 requested a review from SamitHuang December 5, 2025 10:18
@ZJY0516
Copy link
Member

ZJY0516 commented Dec 5, 2025

LGTM. I'll try to run it locally

cache_adapter="tea_cache",
cache_config={"rel_l1_thresh": 0.2}
)
omni = OmniDiffusion(od_config=config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
omni = OmniDiffusion(od_config=config)
omni = Omni(od_config=config)

@hsliuustc0106
Copy link
Collaborator

any progress on this PR? let's get it done asap. @LawJarp-A
@ZJY0516 have you tried locally?

@ZJY0516
Copy link
Member

ZJY0516 commented Dec 9, 2025

any progress on this PR? let's get it done asap. @LawJarp-A @ZJY0516 have you tried locally?

yes. But there are a few issues I‘m currently fixing. The target is to have it ready tomorrow.

@ZJY0516
Copy link
Member

ZJY0516 commented Dec 9, 2025

from vllm_omni.diffusion.data import OmniDiffusionConfig
from vllm_omni.entrypoints.omni import Omni

if __name__ == "__main__":
    omni = Omni(
        model="Qwen/Qwen-Image",
        cache_adapter="tea_cache",
        cache_config={"rel_l1_thresh": 0.2},
    )
    import time
    start = time.perf_counter()
    images = omni.generate("a cat", num_inference_steps=50) 
    end = time.perf_counter()
    print(f"Generation took {end - start:.2f} seconds")
    images[0].save("qwen_image_teacache_example.png")

Tested on H20
66.35s -> 31.26s

…agnostic

Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
@ZJY0516
Copy link
Member

ZJY0516 commented Dec 11, 2025

One last problem:

/home/zjy/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/zjy/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
WARNING:vllm_omni.diffusion.diffusion_engine:Failed to send shutdown signal: 'NoneType' object has no attribute 'dumps'

@ZJY0516
Copy link
Member

ZJY0516 commented Dec 11, 2025

One last problem:

/home/zjy/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/zjy/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
WARNING:vllm_omni.diffusion.diffusion_engine:Failed to send shutdown signal: 'NoneType' object has no attribute 'dumps'

This is unrelated to the PR; it's a local environment issue on my end.

Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Copy link
Collaborator

@SamitHuang SamitHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice work. Some interfaces will be altered after merge to improve compability with cache-dit in #250 , (e.g. DiffusionCacheConfig dataclass, parse pipeline instead of transformer to the adapter and rm cache_config.model_cls_name which can be obtained from pipeline.class.name )

Signed-off-by: Samit <285365963@qq.com>
@ZJY0516 ZJY0516 changed the title Feature/teacache integration [Feature] teacache integration Dec 12, 2025
@ZJY0516 ZJY0516 enabled auto-merge (squash) December 12, 2025 02:52
@ZJY0516 ZJY0516 merged commit 65ca131 into vllm-project:main Dec 12, 2025
4 checks passed
@LawJarp-A
Copy link
Contributor Author

@SamitHuang @ZJY0516 thanks for the patience and feedback!

congw729 pushed a commit to congw729/vllm-omni that referenced this pull request Dec 12, 2025
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Samit <285365963@qq.com>
LawJarp-A added a commit to LawJarp-A/vllm-omni that referenced this pull request Dec 12, 2025
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Samit <285365963@qq.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
LawJarp-A added a commit to LawJarp-A/vllm-omni that referenced this pull request Dec 12, 2025
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Samit <285365963@qq.com>
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
faaany pushed a commit to faaany/vllm-omni that referenced this pull request Dec 19, 2025
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Samit <285365963@qq.com>
Signed-off-by: Fanli Lin <fanli.lin@intel.com>
yenuo26 pushed a commit to yenuo26/vllm-omni that referenced this pull request Dec 29, 2025
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Samit <285365963@qq.com>
Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
princepride pushed a commit to princepride/vllm-omni that referenced this pull request Jan 10, 2026
Signed-off-by: Prajwal A <prajwalanagani@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Co-authored-by: zjy0516 <riverclouds.zhu@qq.com>
Co-authored-by: Samit <285365963@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants