Skip to content

[docs] Adds a doc on LoRA support for diffusers #2086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 25, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@
title: Dreambooth
- local: training/text2image
title: Text-to-image fine-tuning
- local: training/lora
title: LoRA Support in Diffusers
title: Training
- sections:
- local: conceptual/philosophy
Expand Down
128 changes: 128 additions & 0 deletions docs/source/en/training/lora.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# LoRA Support in Diffusers

Diffusers support LoRA for Stable Diffusion for faster fine-tuning allowing greater memory efficiency and easier portability.

Low-Rank Adaption of Large Language Models was first introduced by Microsoft in
[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*.

In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-decomposition weight matrices (called **update marrices**)
to existing weights and **only** training those newly added weights. This has a couple of advantages:

- Previous pretrained weights are kept frozen so that model is not prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114).
- Rank-decomposition matrices have significantly fewer parameters than original model, which means that trained LoRA weights are easily portable.
- LoRA attention layers allow to control to which extent the model is adapted toward new training images via a `scale` parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: so far we've only mentioned "update matrices", but not how they work or whether they contain attention layers. Maybe we should very briefly introduce the concept? Something simple like "LoRA matrices are added to the model attention layers and they control ..." could work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if the current edits make sense.


[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository.

<Tip>

LoRA also allows us to achieve greater memory efficiency since the pretrained weights are kept frozen, only the LoRA weights are trained, thereby
allowing us to run fine-tuning on consumer GPUs like Tesla T4.

</Tip>

## Getting started with LoRA for fine-tuning

Stable Diffusion can be fine-tuned in different ways:

* [Textual inversion](https://huggingface.co/docs/diffusers/main/en/training/text_inversion)
* [DreamBooth](https://huggingface.co/docs/diffusers/main/en/training/dreambooth)
* [Text2Image fine-tuning](https://huggingface.co/docs/diffusers/main/en/training/text2image)

We provide two end-to-end examples that show how to run fine-tuning with LoRA:

* [DreamBooth](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#training-with-low-rank-adaptation-of-large-language-models-lora)
* [Text2Image](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-lora)

If you want to perform DreamBooth training with LoRA, for instance, you would run:

```bash
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export INSTANCE_DIR="path-to-instance-images"
export OUTPUT_DIR="path-to-save-model"

accelerate launch train_dreambooth_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=100 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=50 \
--seed="0" \
--push_to_hub
```

Refer to the respective examples linked above to learn more.

<Tip>

When using LoRA we can use a much higher learning rate (typically 1e-4 as opposed to 1e-5) compared to non-LoRA fine-tuning.

</Tip>

But there is no free lunch. For the given dataset and expected generation quality, you'd still need to experiment with
different hyperparameters. Here are some important ones:

* Training time
* Learning rate
* Number of training steps
* Inference time
* Number of steps
* Scheduler type

Additionally, you can follow [this blog](https://huggingface.co/blog/dreambooth) that documents some of our experimental
findings for performing DreamBooth training Stable Diffusion.

When fine-tuning, the LoRA update matrices are only added to the attention layers. To enable this, we added new weight
loading functionalities. Their details are available [here](https://huggingface.co/docs/diffusers/main/en/api/loaders).

## Inference

Assuming, you used the `examples/text_to_image/train_text_to_image_lora.py` to fine-tune Stable Diffusion on the [Pokemons
dataset](https://huggingface.co/lambdalabs/pokemon-blip-captions), you can perform inference like so:

```py
from diffusers import StableDiffusionPipeline
import torch

model_path = "sayakpaul/sd-model-finetuned-lora-t4"
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) maybe we can show how to retrieve the base_model from the model card by loading the yaml code via huggingface_hub

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from huggingface_hub.repocard import RepoCard

card = RepoCard.load("sayakpaul/sd-model-finetuned-lora-t4")
card.data.to_dict()["base_model"]
# 'CompVis/stable-diffusion-v1-4'

I guess we would want to show it in a separate code snippet from the doc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Maybe include it as a tip below the current snippet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it's fine in the same code snippet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if the current changes make sense.

pipe.unet.load_attn_procs(model_path)
pipe.to("cuda")

prompt = "A pokemon with green eyes and red legs."
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
image.save("pokemon.png")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, maybe display the image here? We never do it in the docs, what's your opinion about starting doing it to make things more visual?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diffusion for computer vision is definitely about visuals. I like the idea and I think we should definitely add it :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an image.

```

[`sayakpaul/sd-model-finetuned-lora-t4`](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4) contains [LoRA fine-tuned update matrices](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin)
which is only 3 MBs in size. During inference, the pre-trained Stable Diffusion checkpoints loaded alongside these update
matrices and then they are combined to run inference.

Inference for DreamBooth training remains the same. Check
[this section](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#inference-1) for more details.

## Known limitations

* Currently, we only support LoRA for the attention layers of [`UNet2DConditionModel`](https://huggingface.co/docs/diffusers/main/en/api/models#diffusers.UNet2DConditionModel).
1 change: 1 addition & 0 deletions docs/source/en/training/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ Training examples show how to pretrain or fine-tune diffusion models for a varie
- [Text-to-Image Training](./text2image)
- [Text Inversion](./text_inversion)
- [Dreambooth](./dreambooth)
- [LoRA Support](./lora)

If possible, please [install xFormers](../optimization/xformers) for memory efficient attention. This could help make your training faster and less memory intensive.

Expand Down
6 changes: 3 additions & 3 deletions examples/text_to_image/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,9 +162,9 @@ accelerate --mixed_precision="fp16" launch train_text_to_image_lora.py \

The above command will also run inference as fine-tuning progresses and log the results to Weights and Biases.

**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run `train_text_to_image_lora.py` in consumer GPUs like T4 or V100.**
**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run `train_text_to_image_lora.py` in consumer GPUs like T4 or V100.___**

The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finetuned-lora-t4](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4). **___Note: [The final weights](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) are only 3 MB in size, which is orders of magnitudes smaller than the original model.**
The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finetuned-lora-t4](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4). **___Note: [The final weights](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) are only 3 MB in size, which is orders of magnitudes smaller than the original model.___**

You can check some inference samples that were logged during the course of the fine-tuning process [here](https://wandb.ai/sayakpaul/text2image-fine-tune/runs/q4lc0xsw).

Expand All @@ -191,7 +191,7 @@ image.save("pokemon.png")

For faster training on TPUs and GPUs you can leverage the flax training example. Follow the instructions above to get the model and dataset before running the script.

____Note: The flax example don't yet support features like gradient checkpoint, gradient accumulation etc, so to use flax for faster training we will need >30GB cards.___
**___Note: The flax example don't yet support features like gradient checkpoint, gradient accumulation etc, so to use flax for faster training we will need >30GB cards.___**


Before running the scripts, make sure to install the library's training dependencies:
Expand Down