Skip to content

[LoRA] Discussions on ensuring robust LoRA support in Diffusers #3620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sayakpaul opened this issue May 31, 2023 · 17 comments
Closed

[LoRA] Discussions on ensuring robust LoRA support in Diffusers #3620

sayakpaul opened this issue May 31, 2023 · 17 comments

Comments

@sayakpaul
Copy link
Member

For the last few months, we have been collaborating with our contributors to ensure we support LoRA effectively and efficiently from Diffusers:

1. Training support

DreamBooth (letting users perform LoRA fine-tuning of both UNet and text-encoder). There were some issues in the text encoder part which are now being fixed in #3437. Thanks to @takuma104.
Vanilla text-to-image fine-tuning. We support only the fine-tuning of UNet with LoRA purposefully since here we'd assume that the number of image-caption pairs is higher than what is typically used for DreamBooth and therefore, text encoder fine-tuning is probably an overkill.

2. Interoperability

With #3437, we're introducing limited support for loading A1111 CivitAI checkpoints with pipeline.load_lora_weights(). This has been a widely requested feature (see #3064 as an example).

We do provide a convert_lora_safetensor_to_diffusers.py script as well that allows for converting A1111 LoRA checkpoints (potentially non-exhaustive) and merging them to the text encoder and the UNet of a DiffusionPipeline. However, this doesn't allow switching the attention processor back to the default one, unlike how it's currently in Diffusers. Check out https://huggingface.co/docs/diffusers/main/en/training/lora for more details. For inference-only and definitive workflows (where one doesn't need to switch attention processors), it caters to many use cases.

3. xformers support for efficient inference

Once LoRA parameters are loaded into a pipeline, xformers should work seamlessly. There was apparently a problem with that and it's fixed in #3556.

4. PT 2.0 SDPA optimization

See: #3594

5. torch.compile() compatibility with LoRA

Once 4. is settled, we should be able to take advantage of torch.compile().

6. Introduction of scale for control the contributions from the text encoder LoRA

See #3480. We already support passing scale as a part of cross_attention_kwargs for the UNet LoRA.

7. Supporting multiple LoRAs

@takuma104 proposed a hook-based design here: #3064 (comment)

I hope this helps to provide a consolidated view of where we're at regarding supporting LoRA from Diffusers.

Cc: @pcuenca @patrickvonplaten

@patrickvonplaten
Copy link
Contributor

Thanks a lot for the great summary!

I agree with all the points except for 7. where it'd like to wait a bit since it don't (yet) see the importance of having multiple LoRAs loaded into the model at once. Let me open a quick draft PR for 6. and link it to #3480

@sayakpaul
Copy link
Member Author

I agree with all the points except for 7. where it'd like to wait a bit since it don't (yet) see the importance of having multiple LoRAs loaded into the model at once.

Same.

@bghira
Copy link
Contributor

bghira commented Jun 1, 2023

i have been experimenting with SD 2.1 fine-tuning and my results show that tuning the text encoder is pretty important but also dangerous and easily pushed over some kind of numeric cliff of sorts into the "catastrophic forgetting" territory.

i have mostly started freezing all but the last 4 to 7 layers of OpenCLIP during fine-tuning, and this is where i've supplied about 30,000 image-caption pairs of high quality images and captions, done by hand, by human volunteers.

the learning rate is very important for the text encoder. far more than the unet. and the ones inside the current Diffusers code for get_scheduler hardcode a lr_end of 1e-7 which is too high for the consumer systems this training is typically done on. furthermore, the lr_scale option should likely be tuned and enabled by default.

we need to wrap the optimizer in a class that allows emulation of a single target, but using two learning rate schedulers internally, so that we can train the text encoder more slowly than the unet.

my current workarounds are, freezing the TE, stopping training of it about 25% through the final run (25k steps out of 100k steps), and, providing a large amount of "balance" images that are the highest quality human photos out of the dataset, fed at a 20% rate compared to the "real training data". this ensured the least amount of forgetting at the end, with the highest quality results.

additionally, i've added the patched betas scheduler into my training scripts @ bghira/SimpleTuner which i have derived from these examples. this enabled "enforced terminal SNR" which drastically improved the perceived quality of the outputs.

@sayakpaul
Copy link
Member Author

sayakpaul commented Jun 2, 2023

Thanks for sharing these insights! Feel free also share a link to your repo and some visual results.

@jelling
Copy link

jelling commented Jun 5, 2023

Thanks a lot for the great summary!

I agree with all the points except for 7. where it'd like to wait a bit since it don't (yet) see the importance of having multiple LoRAs loaded into the model at once. Let me open a quick draft PR for 6. and link it to #3480

We are training custom LoRAs for different characters in a story. So while many LoRA users will just be applying a global style, this is super important for us.

@patrickvonplaten
Copy link
Contributor

Thanks for the answer @jelling, so I guess the important part is to be able to quickly set/unset different LoRA no?
The load_lora_weights function already supports loading / setting dictionaries:

def load_lora_weights(self, pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], **kwargs):

Would the following solution be ok for your case @jelling:

from diffusers import DiffusionPipeline
from diffusers.utils import _get_model_file

lora_repo_ids = [ ] # list all LoRA repo ids here

pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)

def load_state_dict(repo_id):
    model_file = _get_model_file(repo_id, "pytorch_lora_weights.bin")
    state_dict = torch.load(model_file, map_location="cpu")
    return state_dict

# make sure all LoRA weights are in RAM
lora_state_dicts = {k: load_state_dict(k) for k in lora_repo_ids}

pipe.load_lora_weights(lora_state_dict["<first-character>"])

pipe(...)

pipe.load_lora_weights(lora_state_dict["<second-character>"])

=> this way should be pretty efficient. Would this work for you?

@frankjoshua
Copy link

I want to advocate for step 7. Many people are mixing Loras in A1111 with very interesting results. It would be nice to have many loaded at one time and pass in weights during inference. You could set weights to zero for Loras that you did not want activated.

@jelling
Copy link

jelling commented Jun 8, 2023

I want to advocate for step 7. Many people are mixing Loras in A1111 with very interesting results. It would be nice to have many loaded at one time and pass in weights during inference. You could set weights to zero for Loras that you did not want activated.

@patrickvonplaten this is what I'm trying to accomplish but with diffusers. In your example above, it looks like you are showing how to quickly switch between LoRAs. This is a good feature - and one I was curious about - but we need to run multiple LoRAs at the same time on the same inference. I.e. two character LoRAs would be used to generate a single image.

@sayakpaul
Copy link
Member Author

@patrickvonplaten this is what I'm trying to accomplish but with diffusers. In your example above, it looks like you are showing how to quickly switch between LoRAs. This is a good feature - and one I was curious about - but we need to run multiple LoRAs at the same time on the same inference. I.e. two character LoRAs would be used to generate a single image.

This is something we're actively watching. Upon sufficient request, we'll start brainstorming about it or might even rely on peft. Stay tuned :)

@lionel-alves
Copy link

Congrats for supporting A1111 LoRA format 👏
I also support the request to have multiple LoRA on a given inference, if you look at Civitai, you will see that this is very common. Some LoRA like add_detail are broadly used in combination with a character for example.

@jelling
Copy link

jelling commented Jun 8, 2023

This is something we're actively watching. Upon sufficient request, we'll start brainstorming about it or might even rely on peft. Stay tuned :)

@sayakpaul could you tell me anything about what's entailed in adding support? It's important enough for us that we might try adding support ourselves, if it comes to do it. I haven't gone through the multi-LoRA section of automatic1111 yet, but is there something about how they do it that incompatible with the general diffusers way of doing this?

@sayakpaul
Copy link
Member Author

I think the main bottleneck is around the design i.e., IIUC, they merge the LoRA weights into the UNet. This is not how we do it in Diffusers. We make use of specific attention processor classes so that we can unload a LoRA and carry one.

With the merging weights' design, it's relatively simpler but with our attention processor design, we need to be careful.

@frankjoshua
Copy link

frankjoshua commented Jun 10, 2023

Here is some pseudo code representing they way I wished LoRAs would work. Just an idea. I thought code would be better for explaining my thoughts.

pipe_A = StableDiffusionPipeline.from_pretrained(
                pretrained_model_name_or_path = "/path/to/cool/civitai/modelA.safetensors", 
                torch_dtype torch.float16
            )

pipe_B = StableDiffusionPipeline.from_pretrained(
                pretrained_model_name_or_path = "/path/to/cool/civitai/modelB.safetensors, 
                torch_dtype torch.float16
            )

lora_A = LoRA.from_pretrained(
            pretrained_model_name_or_path = "/path/to/cool/civitai/LoraModelA.safetensors"
        )

lora_B = LoRA.from_pretrained(
            pretrained_model_name_or_path = "/path/to/cool/civitai/LoraModelB.safetensors"
        )

image_A = pipe_A( 
            prompt = "First idea of image",
            loras = [
                {
                    "lora": lora_A,
                    "weight": 0.7
                },
                {
                    "lora": lora_B,
                    "weight": 0.1
                }
            ])[0]

image_B = pipe_B( 
            prompt = "Second idea of image",
            loras = [
                {
                    "lora": lora_A,
                    "weight": 0.2
                }
            ])[0]

image_C = pipe_B( 
            prompt = "Third idea of image",
            loras = None )[0]

In my imagination they are also very fast. Slowing down inference by at most 10%.

@sayakpaul
Copy link
Member Author

This should be possible with peft. We are internally brainstorming about it. Will keep our community posted about that.

@sayakpaul
Copy link
Member Author

Closing this since we introspected quite a bit.

Thread on supporting multiple LoRAs will be a different one.

@frankjoshua
Copy link

What issue should we watch? Is there a different thread currently regarding multiple LoRA support?

@sayakpaul
Copy link
Member Author

Not yet. Will begin soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants