Skip to content

[group offloading] avoid unnecessary moving out to speed up inference#12910

Open
gameofdimension wants to merge 1 commit intohuggingface:mainfrom
gameofdimension:gameofdimension-patch-2
Open

[group offloading] avoid unnecessary moving out to speed up inference#12910
gameofdimension wants to merge 1 commit intohuggingface:mainfrom
gameofdimension:gameofdimension-patch-2

Conversation

@gameofdimension
Copy link
Contributor

@gameofdimension gameofdimension commented Jan 4, 2026

Explicitly moving weights back to the CPU after computation is unnecessary—we can avoid it just like in the use_stream=True case. Since device-to-host copying is expensive, this change significantly improves inference speed when use_stream=False.

image

improvement

device: A100 40G
model: Qwen/Qwen-Image

step latency
baseline 54s
this pr 4.8s

test code

import time
from diffusers import QwenImagePipeline, QwenImageTransformer2DModel
import torch


def main():
    device = "cuda"
    model_name = "Qwen/Qwen-Image"
    torch_dtype = torch.bfloat16
    pipe: QwenImagePipeline = QwenImagePipeline.from_pretrained(model_name, torch_dtype=torch_dtype)
    # pipe.enable_model_cpu_offload(device=device)

    offload_type = "block_level"
    num_blocks_per_group = 1
    use_stream = False
    assert isinstance(pipe.transformer, QwenImageTransformer2DModel)
    pipe.transformer.enable_group_offload(
        onload_device=device,
        offload_device="cpu",
        offload_type=offload_type,
        num_blocks_per_group=num_blocks_per_group,
        use_stream=use_stream,
    )
    pipe.to(device=device)

    positive_magic = {
        "en": ", Ultra HD, 4K, cinematic composition.",  # for english prompt
        "zh": ", 超清,4K,电影级构图.",  # for chinese prompt
    }

    # Generate image
    prompt = """A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197" perfect Ultra HD"""

    negative_prompt = (
        "very bad quality"  # using an empty string if you do not have specific concept to remove
    )

    # Generate with different aspect ratios
    aspect_ratios = {
        "1:1": (1328, 1328),
        "16:9": (1664, 928),
        "9:16": (928, 1664),
        "4:3": (1472, 1140),
        "3:4": (1140, 1472),
        "3:2": (1584, 1056),
        "2:3": (1056, 1584),
    }

    width, height = aspect_ratios["16:9"]
    generator = torch.Generator(device="cpu").manual_seed(42)

    image = pipe(
        prompt=prompt + positive_magic["en"],
        negative_prompt=negative_prompt,
        width=width,
        height=height,
        num_inference_steps=50,
        true_cfg_scale=4.0,
        generator=generator,
    ).images[0]

    image.save(f"example-{int(time.time())}.png")


if __name__ == "__main__":
    main()

What does this PR do?

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Refactor offloading logic to simplify memory management.
@gameofdimension
Copy link
Contributor Author

@DN6 @yiyixuxu Could you please take a look at this change?

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Feb 3, 2026
@iwr-redmond
Copy link

Pinging to remove stale.

@DN6 DN6 removed the stale Issues that haven't received updates label Feb 4, 2026
@DN6
Copy link
Collaborator

DN6 commented Feb 4, 2026

Hi @gameofdimension thanks for putting this together. I believe this change could lead to a big spike in CPU RAM usage right? Would you mind benchmarking the change to get an idea of throughput (iterations/second), GPU VRAM and CPU RAM usage?

@gameofdimension
Copy link
Contributor Author

Hi @gameofdimension thanks for putting this together. I believe this change could lead to a big spike in CPU RAM usage right? Would you mind benchmarking the change to get an idea of throughput (iterations/second), GPU VRAM and CPU RAM usage?

  1. GPU memory utilization remains unchanged
  2. CPU RAM usage will increase a lot
  3. Given the substantial latency improvements demonstrated above, perhaps we can make this an optional feature with runtime configuration

@DN6
Copy link
Collaborator

DN6 commented Feb 5, 2026

@gameofdimension Just curious, why not just enable similar behaviour through CUDA streams? The expectation is already set there that there will be a trade off between CPU memory and speed. Is there some specific case you have that you don't want to use streams?

@gameofdimension
Copy link
Contributor Author

@DN6 IMHO use_stream=True/False alone should not make such huge difference. If I understood correctly use_stream=True enables overlapping computation with weight transfers, while use_stream=False processes them sequentially. The maximum latency difference would be (computation_time + IO_time) vs max(computation_time, IO_time) - at most a 2x difference.

Based on the observed latency differences, I proposed this adjustment for consideration.

@iwr-redmond
Copy link

why not just enable similar behaviour through CUDA streams?

Wouldn't that approach also disadvantage XPU users?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants