[WIP] Add Kandinsky decoder #3330

ayushtues · 2023-05-04T10:29:21Z

Adds MOVQ based decoder for Kandinsky 2.1, part of #3308

to-do:

load pretrained weights from original repo
ensure forward passes result in same output
Integrate with the pipeline

HuggingFaceDocBuilderDev · 2023-05-04T10:35:14Z

The documentation is not available anymore as the PR was closed or merged.

yiyixuxu · 2023-05-04T19:42:20Z

looks super good! thanks!
will do a review in detail later today

yiyixuxu

@ayushtues
great job!! I left a few comments. I will ask @patrickvonplaten or @williamberman give a review too.
we can wait to make changes after getting their feedback :)

src/diffusers/models/vae.py

src/diffusers/models/unet_2d_blocks.py

patrickvonplaten · 2023-05-05T18:20:03Z

src/diffusers/models/attention.py

@@ -55,12 +55,18 @@ def __init__(
        norm_num_groups: int = 32,
        rescale_output_factor: float = 1.0,
        eps: float = 1e-5,
+        use_spatial_norm: bool = False,


This class is deprecated and should not be used anymore cc @williamberman big time we remove it ;-)

Let's please try to solve this in the other attention class

@yiyixuxu Can you help to instead make this work with the Attention class in attention_processor.py?

@patrickvonplaten

So instead of using existing UNetMidBlock2D, AttnUpDecoderBlock2D (which currently use the deprecated class), we should:

write new decoder blocks to use the Attention class instead, think it will be very easy, just add an option here https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py#L60

once UNetMidBlock2D, AttnUpDecoderBlock2D are refactored with Attention class, we consolidate them

is this plan ok?

We will need to use AttnAddedKVProcessor since to replicate the attention used in Kandinksy, we need residual connections and a norm on the hiddens states before passing for q-k-v calculations (both of which are only present in AttnAddedKVProcessor, but it has this concatenation thing going on in the k-v calculation here, which needs to be tackled.

This is the attention implementation in the original repo for reference:

h_ = x h_ = self.norm(h_, zq) q = self.q(h_) k = self.k(h_) v = self.v(h_) # compute attention b, c, h, w = q.shape q = q.reshape(b, c, h * w) q = q.permute(0, 2, 1) # b,hw,c k = k.reshape(b, c, h * w) # b,c,hw w_ = torch.bmm(q, k) # b,hw,hw w[b,i,j]=sum_c q[b,i,c]k[b,c,j] w_ = w_ * (int(c) ** (-0.5)) w_ = torch.nn.functional.softmax(w_, dim=2) # attend to values v = v.reshape(b, c, h * w) w_ = w_.permute(0, 2, 1) # b,hw,hw (first hw of k, second of q) h_ = torch.bmm(v, w_) # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j] h_ = h_.reshape(b, c, h, w) h_ = self.proj_out(h_) return x + h_

@ayushtues ohh I don't think we should use AttnAddedKVProcessor since we don't have these additional projection layers for k and v as the processor name indicated 😁

for attention processors, I think we can either

create a new attention processor, e.g. MOVQAttnProcessor

or make it work with AttnProcessor (https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py#L381), we need to:

add an argument to allow skip-connection

add the norm layer, maybe before this line, we can add https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py#L393

if self.group_norm: hidden_states = self.group_norm(hidden_states)

we also need slightly refactor theAttention class here (https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py#L92) so the group_norm layer can be spatial_norm

I would wait for @patrickvonplaten to clarify how we should approach this. It's a little bit tricky IMO because it's overlapping a larger effort here to completely replace the deprecated AttentionBlock class

To me, this sounds like a better approach:

create a new attention processor, e.g. MOVQAttnProcessor

(with the new Attention class)

I think this helps to eliminate any overlap between this PR and the larger effort for refactoring the attention blocks.

Okay makes sense, will add a new processor for this then, thanks!

@ayushtues great! let me know once you need another review:) we need to add tests too once everything looks good

I instead added a MOVQAttention in unet_2d_blocks here https://github.com/ayushtues/diffusers/blob/kandinsky_decoder/src/diffusers/models/unet_2d_blocks.py#L1957, which uses the basic AttentionProcessor, and handles all the other processing needed in it, so that we need to make no changes in either attention.py or attention_processor.py, something similar seemed to be done here in the past: https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/t5_film_transformer.py#L185

Let me know how this looks, and maybe we can move this block elsewhere if needed

src/diffusers/models/resnet.py

src/diffusers/models/vq_model.py

src/diffusers/models/attention.py

src/diffusers/models/unet_2d_blocks.py

src/diffusers/models/vae.py

src/diffusers/models/vq_model.py

ayushtues · 2023-05-06T17:15:08Z

Thanks @yiyixuxu @patrickvonplaten for the detailed review, will have a look and make the changes soon!

patrickvonplaten · 2023-05-10T18:29:12Z

Great ❤️ Let us know when you need another review @ayushtues

JincanDeng · 2023-05-12T07:00:58Z

Looking forward to this!

ayushtues · 2023-05-12T10:34:31Z

@yiyixuxu was able to use the decoder implementation in diffusers and generate images, colab here : https://colab.research.google.com/drive/1jhMcNi9k3xkkuDHZh6jgY0MvsSODWden?usp=sharing

Let me integrate this into the pipeline next

patrickvonplaten · 2023-05-12T11:56:18Z

Exciting!

ayushtues · 2023-05-12T12:54:14Z

Integrated decoder into text2img pipeline, the below works! Colab : https://colab.research.google.com/drive/1jhMcNi9k3xkkuDHZh6jgY0MvsSODWden#scrollTo=vRwq-F6Q-mjv

from diffusers import KandinskyPipeline, KandinskyPriorPipeline
from transformers import AutoTokenizer

import torch
import numpy as np

device = "cuda"

# # inputs
prompt= "red cat, 4k photo"
batch_size=1 


# # create prior 
pipe_prior = KandinskyPriorPipeline.from_pretrained("YiYiXu/Kandinsky-prior")
pipe_prior.to("cuda")

# use prior to generate image_emb based on our prompt
generator = torch.Generator(device=device).manual_seed(0)
image_emb = pipe_prior(prompt, generator=generator,)
zero_image_emb = pipe_prior("")

pipe = KandinskyPipeline.from_pretrained("ayushtues/test-kandinksy")
pipe.to(device)


generator = torch.Generator(device="cuda").manual_seed(0)
samples = pipe(
    prompt,
    image_embeds=image_emb,
    negative_image_embeds =zero_image_emb,
    height=768,
    width=768,
    num_inference_steps=100,
    generator=generator )

samples[0].save("k_image_d_test.png")

ayushtues · 2023-05-12T12:55:10Z

@yiyixuxu maybe you can give it a review now

yiyixuxu

Thanks for addressing our feedback and great job overall!

However, I think we need to do the attention differently -it's going to be pretty straightforward now as this PR just got merged #3387

I left some comments but I think the easiest way is for me to merge this into my PR and change it from there. We can review it together afterward :)

In the meantime, feel free to start on the img2img!

yiyixuxu · 2023-05-12T16:49:34Z

src/diffusers/models/vq_model.py

@@ -82,9 +82,11 @@ def __init__(
        norm_num_groups: int = 32,
        vq_embed_dim: Optional[int] = None,
        scaling_factor: float = 0.18215,
+        norm_type: str = "default"


Suggested change

norm_type: str = "default"

norm_type: str = "default" # "default", "spatial"

yiyixuxu · 2023-05-12T17:43:18Z

src/diffusers/models/unet_2d_blocks.py

@@ -426,15 +427,23 @@ def __init__(

        for _ in range(num_layers):
            if self.add_attention:
-                attentions.append(


I think we shouldn't change this class -

yiyixuxu · 2023-05-12T17:45:47Z

src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py

@@ -44,6 +45,21 @@ def get_new_h_w(h, w):
        new_w += 1
    return new_h * 8, new_w * 8

+def process_images(batch):


we don't need this function, let's do post-processing similar to what's done here https://github.com/huggingface/diffusers/blob/kandinsky/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_inpaint.py#L452

yiyixuxu · 2023-05-12T17:47:03Z

src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py

@@ -94,6 +114,13 @@ def prepare_latents(self, shape, dtype, device, generator, latents, scheduler):
        latents = latents * scheduler.init_noise_sigma
        return latents

+    def get_image(self, latents):


same, don't need to create a separate method

yiyixuxu · 2023-05-12T17:47:23Z

src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py

@@ -371,4 +398,5 @@ def __call__(

            _, latents = latents.chunk(2)

-        return latents
+        images = self.get_image(latents)


let's do post-processing similar to what's done here https://github.com/huggingface/diffusers/blob/kandinsky/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_inpaint.py#L452

yiyixuxu · 2023-05-12T17:48:19Z

src/diffusers/models/unet_2d_blocks.py

@@ -1945,6 +1955,30 @@ def custom_forward(*inputs):
        return hidden_states


+class MOVQAttention(nn.Module):


I think we don't need to create a new class anymore now this PR is merged #3387

This reverts commit e74c173.

ayushmangal added 3 commits May 4, 2023 15:26

Add movq

1a16918

Merge branch 'kandinsky' into kandinsky_decoder

e905cb9

Merge decoder conversion script with others

e0582c1

ayushtues mentioned this pull request May 4, 2023

Add Kandinsky 2.1 #3308

Merged

11 tasks

yiyixuxu reviewed May 5, 2023

View reviewed changes

src/diffusers/models/vae.py Outdated Show resolved Hide resolved

src/diffusers/models/vae.py Outdated Show resolved Hide resolved

src/diffusers/models/unet_2d_blocks.py Outdated Show resolved Hide resolved

src/diffusers/models/unet_2d_blocks.py Outdated Show resolved Hide resolved

yiyixuxu requested review from patrickvonplaten, williamberman and sayakpaul May 5, 2023 01:26