[Core] fix QKV fusion for attention #8829

sayakpaul · 2024-07-10T17:28:28Z

What does this PR do?

This PR fixes QKV fusion. Since Attention modules are nested in our modules, the QKV fusion processors should be applied recursively.

Additionally it:

fixes the implementation of FusedJointAttnProcessor2_0 to properly make use of the fused matrices.
adds a new FusedHunyuanAttnProcessor2_0 to respect its use of rotary embeddings.

HuggingFaceDocBuilderDev · 2024-07-10T17:39:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yiyixuxu · 2024-07-17T21:40:42Z

so this did not work at all before - can we make sure to test in the future even for optimization PRs
can we see how much the speed up with this fix?

sayakpaul · 2024-07-18T11:20:13Z

Yeah, Yiyi, I am sorry for the oversight on my part. Fusion of the attention projection matrices becomes relevant when you are quantizing (especially lower precisions like int8). This is because for single projection matrixes, the dimensions are too thin in order for quantization to show its magic.

So, we fuse these projection matrices. We talk a bit about this here: https://pytorch.org/blog/accelerating-generative-ai-3/#dynamic-int8-quantization. In the previous cases, it still worked with quantization when the dimensionality constraints were satisfied. But we didn't get any benefit by thickening the dimensionality of the projection layers in attention.

With the changes in this PR, I am able to obtain the following numbers with quantization.

Without fusion + 8bit quant: Execution time: 6.470 sec
With fusion + 8bit quant: Execution time: 5.944 sec

So, since this is still significant for a small change I would argue.

Code

from diffusers import DiffusionPipeline
import argparse
import torch
import time
import bitsandbytes as bnb
import json

SHORT_NAME_MAPPER = {
    "stabilityai/stable-diffusion-3-medium-diffusers": "sd3",
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS": "pixart"
}


def load_pipeline(args):
    pipeline = DiffusionPipeline.from_pretrained(args.ckpt_id, torch_dtype=torch.float16).to("cuda")

    def replace_regular_linears(module, mode="8bit"):
        for name, child in module.named_children():
            if isinstance(child, torch.nn.Linear):
                in_features = child.in_features
                out_features = child.out_features
                device = child.weight.data.device

                # Create and configure the Linear layer
                has_bias = True if child.bias is not None else False
                if mode == "8bit":
                    new_layer = bnb.nn.Linear8bitLt(in_features, out_features, bias=has_bias, has_fp16_weights=False)
                else:
                    # TODO: Make that configurable
                    # fp16 for compute dtype leads to faster inference
                    # and one should almost always use nf4 as a rule of thumb
                    bnb_4bit_compute_dtype = torch.float16
                    quant_type = "nf4"

                    new_layer = bnb.nn.Linear4bit(
                        in_features,
                        out_features,
                        bias=has_bias,
                        compute_dtype=bnb_4bit_compute_dtype,
                        quant_type=quant_type,
                    )

                new_layer.load_state_dict(child.state_dict())
                new_layer = new_layer.to(device)

                # Set the attribute
                setattr(module, name, new_layer)
            else:
                # Recursively apply to child modules
                replace_regular_linears(child, mode=mode)

    if args.fuse:
        pipeline.transformer.fuse_qkv_projections()

    if args.mode is not None:
        replace_regular_linears(pipeline.transformer, args.mode)

    pipeline.set_progress_bar_config(disable=True)
    return pipeline


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--ckpt_id", default="stabilityai/stable-diffusion-3-medium-diffusers", type=str, choices=list(SHORT_NAME_MAPPER.keys()))
    parser.add_argument("--mode", default=None, type=str, choices=["8bit", "4bit"])
    parser.add_argument("--fuse", default=0, type=int, choices=[0, 1])
    parser.add_argument("--prompt", default="a golden vase with different flowers", type=str)
    args = parser.parse_args()

    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    pipeline = load_pipeline(args)

    for _ in range(5):
        _ = pipeline(args.prompt, generator=torch.manual_seed(2024))

    start = time.time()
    output = pipeline(args.prompt, generator=torch.manual_seed(2024))
    end = time.time()
    mem_bytes = torch.cuda.max_memory_allocated()

    image = output.images[0]
    filename_prefix = f"{SHORT_NAME_MAPPER[args.ckpt_id]}" + "_".join(args.prompt.split(" ")) 
    if args.mode is not None:
        filename_prefix += f"_{args.mode}"
    if args.fuse:
        filename_prefix += f"_fuse@{args.fuse}"
    image.save(f"{filename_prefix}.png")

    print(f"Memory: {mem_bytes/(10**6):.3f} MB")
    print(f"Execution time: {(end - start):.3f} sec")

    info = dict(memory=f"{mem_bytes/(10**6):.3f}", time=f"{(end - start):.3f}")
    with open(f"{filename_prefix}.json", "w") as f:
        json.dump(info, f)

LMK if something is still unclear.

I have tried by best to modify the tests so that we can be more rigorous about these silent bugs. But let me know if you have further ideas.

yiyixuxu

thanks!
I think it indeed make sense to add this!

src/diffusers/models/transformers/transformer_sd3.py

yiyixuxu

thanks! I left one comment about the test
look very good to me otherwise

src/diffusers/utils/testing_utils.py

…processors_exist.

sayakpaul · 2024-07-20T04:32:52Z

@yiyixuxu done!

yiyixuxu

thanks!

sayakpaul · 2024-07-24T01:22:36Z

Thank you for bearing with my oversight. Appreciate the patience.

* start debugging the problem, * start * fix * fix * fix imports. * handle hunyuan * remove residuals. * add a check for making sure there's appropriate procs. * add more rigor to the tests. * fix test * remove redundant check * fix-copies * move check_qkv_fusion_matches_attn_procs_length and check_qkv_fusion_processors_exist.

sayakpaul added 7 commits July 10, 2024 06:25

start debugging the problem,

3dceb84

start

a352a67

fix

77ab545

fix

2ee335f

fix imports.

ddcc102

handle hunyuan

8f46177

remove residuals.

ee39007

sayakpaul marked this pull request as ready for review July 11, 2024 05:20

Merge branch 'main' into fix-qkv-fusion

a774721

sayakpaul requested review from DN6 and yiyixuxu and removed request for DN6 and yiyixuxu July 11, 2024 05:20

sayakpaul marked this pull request as draft July 11, 2024 05:28

sayakpaul requested review from yiyixuxu and DN6 and removed request for yiyixuxu July 11, 2024 05:31

sayakpaul marked this pull request as ready for review July 11, 2024 05:31

sayakpaul mentioned this pull request Jul 11, 2024

[Alpha-VLLM Team] Feat: added fused qkv and chunked ffn #8815

Closed

6 tasks

add a check for making sure there's appropriate procs.

be3adcf

sayakpaul added 2 commits July 18, 2024 10:38

Merge branch 'main' into fix-qkv-fusion

991cc56

Merge branch 'main' into fix-qkv-fusion

1575973

add more rigor to the tests.

eb94d4f

sayakpaul requested a review from yiyixuxu July 18, 2024 11:35

fix test

ce67fe8

yiyixuxu reviewed Jul 19, 2024

View reviewed changes

src/diffusers/models/transformers/transformer_sd3.py Outdated Show resolved Hide resolved

remove redundant check

d563b9e

fix-copies

8214c88

sayakpaul requested a review from yiyixuxu July 19, 2024 12:04

yiyixuxu reviewed Jul 19, 2024

View reviewed changes

src/diffusers/utils/testing_utils.py Outdated Show resolved Hide resolved

sayakpaul added 2 commits July 20, 2024 09:58

Merge branch 'main' into fix-qkv-fusion

09741b0

move check_qkv_fusion_matches_attn_procs_length and check_qkv_fusion_…

81caa93

…processors_exist.

Merge branch 'main' into fix-qkv-fusion

e4b14db

sayakpaul requested a review from yiyixuxu July 23, 2024 04:44

yiyixuxu approved these changes Jul 23, 2024

View reviewed changes

Merge branch 'main' into fix-qkv-fusion

8fa4a09

sayakpaul merged commit 50d21f7 into main Jul 24, 2024
16 of 18 checks passed

sayakpaul deleted the fix-qkv-fusion branch July 24, 2024 01:22

sayakpaul mentioned this pull request Jul 24, 2024

[Core] add QKV fusion to AuraFlow and PixArt Sigma #8952

Merged

sayakpaul mentioned this pull request Aug 16, 2024

[Core] fuse_qkv_projection() to Flux #9185

Merged

sayakpaul mentioned this pull request Sep 2, 2024

NF4 Flux params in diffusers #9165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] fix QKV fusion for attention #8829

[Core] fix QKV fusion for attention #8829

sayakpaul commented Jul 10, 2024

HuggingFaceDocBuilderDev commented Jul 10, 2024

yiyixuxu commented Jul 17, 2024

sayakpaul commented Jul 18, 2024

yiyixuxu left a comment

yiyixuxu left a comment

sayakpaul commented Jul 20, 2024

yiyixuxu left a comment

sayakpaul commented Jul 24, 2024

[Core] fix QKV fusion for attention #8829

[Core] fix QKV fusion for attention #8829

Conversation

sayakpaul commented Jul 10, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Jul 10, 2024

yiyixuxu commented Jul 17, 2024

sayakpaul commented Jul 18, 2024

yiyixuxu left a comment

Choose a reason for hiding this comment

yiyixuxu left a comment

Choose a reason for hiding this comment

sayakpaul commented Jul 20, 2024

yiyixuxu left a comment

Choose a reason for hiding this comment

sayakpaul commented Jul 24, 2024