From 58a930f5606a6e0193449d2d92fee9bfcd0445f6 Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Tue, 24 Jan 2023 13:39:30 +0530 Subject: [PATCH 1/5] add: a doc on LoRA support in diffusers. --- docs/source/en/_toctree.yml | 2 + docs/source/en/training/lora.mdx | 128 +++++++++++++++++++++++++++ docs/source/en/training/overview.mdx | 1 + examples/text_to_image/README.md | 6 +- 4 files changed, 134 insertions(+), 3 deletions(-) create mode 100644 docs/source/en/training/lora.mdx diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index e580181251a9..c463fd843cec 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -71,6 +71,8 @@ title: Dreambooth - local: training/text2image title: Text-to-image fine-tuning + - local: training/lora + title: LoRA Support in Diffusers title: Training - sections: - local: conceptual/philosophy diff --git a/docs/source/en/training/lora.mdx b/docs/source/en/training/lora.mdx new file mode 100644 index 000000000000..81147d977d2e --- /dev/null +++ b/docs/source/en/training/lora.mdx @@ -0,0 +1,128 @@ + + +# LoRA Support in Diffusers + +Diffusers support LoRA for Stable Diffusion for faster fine-tuning allowing greater memory efficiency and easier portability. + +Low-Rank Adaption of Large Language Models was first introduced by Microsoft in +[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*. + +In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-decomposition weight matrices (called **update marrices**) +to existing weights and **only** training those newly added weights. This has a couple of advantages: + +- Previous pretrained weights are kept frozen so that model is not prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114). +- Rank-decomposition matrices have significantly fewer parameters than original model, which means that trained LoRA weights are easily portable. +- LoRA attention layers allow to control to which extent the model is adapted toward new training images via a `scale` parameter. + +[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository. + + + +LoRA also allows us to achieve greater memory efficiency since the pretrained weights are kept frozen, only the LoRA weights are trained, thereby +allowing us to run fine-tuning on consumer GPUs like Tesla T4. + + + +## Getting started with LoRA for fine-tuning + +Stable Diffusion can be fine-tuned in different ways: + +* [Textual inversion](https://huggingface.co/docs/diffusers/main/en/training/text_inversion) +* [DreamBooth](https://huggingface.co/docs/diffusers/main/en/training/dreambooth) +* [Text2Image fine-tuning](https://huggingface.co/docs/diffusers/main/en/training/text2image) + +We provide two end-to-end examples that show how to run fine-tuning with LoRA: + +* [DreamBooth](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#training-with-low-rank-adaptation-of-large-language-models-lora) +* [Text2Image](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-lora) + +If you want to perform DreamBooth training with LoRA, for instance, you would run: + +```bash +export MODEL_NAME="runwayml/stable-diffusion-v1-5" +export INSTANCE_DIR="path-to-instance-images" +export OUTPUT_DIR="path-to-save-model" + +accelerate launch train_dreambooth_lora.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --output_dir=$OUTPUT_DIR \ + --instance_prompt="a photo of sks dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=1 \ + --checkpointing_steps=100 \ + --learning_rate=1e-4 \ + --report_to="wandb" \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --max_train_steps=500 \ + --validation_prompt="A photo of sks dog in a bucket" \ + --validation_epochs=50 \ + --seed="0" \ + --push_to_hub +``` + +Refer to the respective examples linked above to learn more. + + + +When using LoRA we can use a much higher learning rate (typically 1e-4 as opposed to 1e-5) compared to non-LoRA fine-tuning. + + + +But there is no free lunch. For the given dataset and expected generation quality, you'd still need to experiment with +different hyperparameters. Here are some important ones: + +* Training time + * Learning rate + * Number of training steps +* Inference time + * Number of steps + * Scheduler type + +Additionally, you can follow [this blog](https://huggingface.co/blog/dreambooth) that documents some of our experimental +findings for performing DreamBooth training Stable Diffusion. + +When fine-tuning, the LoRA update matrices are only added to the attention layers. To enable this, we added new weight +loading functionalities. Their details are available [here](https://huggingface.co/docs/diffusers/main/en/api/loaders). + +## Inference + +Assuming, you used the `examples/text_to_image/train_text_to_image_lora.py` to fine-tune Stable Diffusion on the [Pokemons +dataset](https://huggingface.co/lambdalabs/pokemon-blip-captions), you can perform inference like so: + +```py +from diffusers import StableDiffusionPipeline +import torch + +model_path = "sayakpaul/sd-model-finetuned-lora-t4" +pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16) +pipe.unet.load_attn_procs(model_path) +pipe.to("cuda") + +prompt = "A pokemon with green eyes and red legs." +image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0] +image.save("pokemon.png") +``` + +[`sayakpaul/sd-model-finetuned-lora-t4`](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4) contains [LoRA fine-tuned update matrices](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) +which is only 3 MBs in size. During inference, the pre-trained Stable Diffusion checkpoints loaded alongside these update +matrices and then they are combined to run inference. + +Inference for DreamBooth training remains the same. Check +[this section](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#inference-1) for more details. + +## Known limitations + +* Currently, we only support LoRA for the attention layers of [`UNet2DConditionModel`](https://huggingface.co/docs/diffusers/main/en/api/models#diffusers.UNet2DConditionModel). diff --git a/docs/source/en/training/overview.mdx b/docs/source/en/training/overview.mdx index fd6ec184d274..49aab9aa3647 100644 --- a/docs/source/en/training/overview.mdx +++ b/docs/source/en/training/overview.mdx @@ -37,6 +37,7 @@ Training examples show how to pretrain or fine-tune diffusion models for a varie - [Text-to-Image Training](./text2image) - [Text Inversion](./text_inversion) - [Dreambooth](./dreambooth) +- [LoRA Support](./lora) If possible, please [install xFormers](../optimization/xformers) for memory efficient attention. This could help make your training faster and less memory intensive. diff --git a/examples/text_to_image/README.md b/examples/text_to_image/README.md index c9b10ea18a8c..9d7cbdf30d34 100644 --- a/examples/text_to_image/README.md +++ b/examples/text_to_image/README.md @@ -162,9 +162,9 @@ accelerate --mixed_precision="fp16" launch train_text_to_image_lora.py \ The above command will also run inference as fine-tuning progresses and log the results to Weights and Biases. -**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run `train_text_to_image_lora.py` in consumer GPUs like T4 or V100.** +**___Note: When using LoRA we can use a much higher learning rate compared to non-LoRA fine-tuning. Here we use *1e-4* instead of the usual *1e-5*. Also, by using LoRA, it's possible to run `train_text_to_image_lora.py` in consumer GPUs like T4 or V100.___** -The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finetuned-lora-t4](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4). **___Note: [The final weights](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) are only 3 MB in size, which is orders of magnitudes smaller than the original model.** +The final LoRA embedding weights have been uploaded to [sayakpaul/sd-model-finetuned-lora-t4](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4). **___Note: [The final weights](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) are only 3 MB in size, which is orders of magnitudes smaller than the original model.___** You can check some inference samples that were logged during the course of the fine-tuning process [here](https://wandb.ai/sayakpaul/text2image-fine-tune/runs/q4lc0xsw). @@ -191,7 +191,7 @@ image.save("pokemon.png") For faster training on TPUs and GPUs you can leverage the flax training example. Follow the instructions above to get the model and dataset before running the script. -____Note: The flax example don't yet support features like gradient checkpoint, gradient accumulation etc, so to use flax for faster training we will need >30GB cards.___ +**___Note: The flax example don't yet support features like gradient checkpoint, gradient accumulation etc, so to use flax for faster training we will need >30GB cards.___** Before running the scripts, make sure to install the library's training dependencies: From 233d64683283a4849231cc0d21db86eace2c3f35 Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Wed, 25 Jan 2023 08:54:34 +0530 Subject: [PATCH 2/5] Apply suggestions from code review Co-authored-by: Pedro Cuenca --- docs/source/en/training/lora.mdx | 22 +++++++++++----------- examples/text_to_image/README.md | 2 +- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/source/en/training/lora.mdx b/docs/source/en/training/lora.mdx index 81147d977d2e..bc11e7aa99ad 100644 --- a/docs/source/en/training/lora.mdx +++ b/docs/source/en/training/lora.mdx @@ -12,24 +12,24 @@ specific language governing permissions and limitations under the License. # LoRA Support in Diffusers -Diffusers support LoRA for Stable Diffusion for faster fine-tuning allowing greater memory efficiency and easier portability. +Diffusers supports LoRA for faster fine-tuning of Stable Diffusion, allowing greater memory efficiency and easier portability. Low-Rank Adaption of Large Language Models was first introduced by Microsoft in [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) by *Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen*. -In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-decomposition weight matrices (called **update marrices**) +In a nutshell, LoRA allows adapting pretrained models by adding pairs of rank-decomposition weight matrices (called **update matrices**) to existing weights and **only** training those newly added weights. This has a couple of advantages: -- Previous pretrained weights are kept frozen so that model is not prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114). -- Rank-decomposition matrices have significantly fewer parameters than original model, which means that trained LoRA weights are easily portable. +- Previous pretrained weights are kept frozen so that the model is not so prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114). +- Rank-decomposition matrices have significantly fewer parameters than the original model, which means that trained LoRA weights are easily portable. - LoRA attention layers allow to control to which extent the model is adapted toward new training images via a `scale` parameter. [cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository. -LoRA also allows us to achieve greater memory efficiency since the pretrained weights are kept frozen, only the LoRA weights are trained, thereby -allowing us to run fine-tuning on consumer GPUs like Tesla T4. +LoRA allows us to achieve greater memory efficiency since the pretrained weights are kept frozen and only the LoRA weights are trained, thereby +allowing us to run fine-tuning on consumer GPUs like Tesla T4, RTX 3080 or even RTX 2080 Ti! @@ -77,7 +77,7 @@ Refer to the respective examples linked above to learn more. -When using LoRA we can use a much higher learning rate (typically 1e-4 as opposed to 1e-5) compared to non-LoRA fine-tuning. +When using LoRA we can use a much higher learning rate (typically 1e-4 as opposed to ~1e-6) compared to non-LoRA Dreambooth fine-tuning. @@ -92,15 +92,15 @@ different hyperparameters. Here are some important ones: * Scheduler type Additionally, you can follow [this blog](https://huggingface.co/blog/dreambooth) that documents some of our experimental -findings for performing DreamBooth training Stable Diffusion. +findings for performing DreamBooth training of Stable Diffusion. When fine-tuning, the LoRA update matrices are only added to the attention layers. To enable this, we added new weight loading functionalities. Their details are available [here](https://huggingface.co/docs/diffusers/main/en/api/loaders). ## Inference -Assuming, you used the `examples/text_to_image/train_text_to_image_lora.py` to fine-tune Stable Diffusion on the [Pokemons -dataset](https://huggingface.co/lambdalabs/pokemon-blip-captions), you can perform inference like so: +Assuming you used the `examples/text_to_image/train_text_to_image_lora.py` to fine-tune Stable Diffusion on the [Pokemon +dataset](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions), you can perform inference like so: ```py from diffusers import StableDiffusionPipeline @@ -117,7 +117,7 @@ image.save("pokemon.png") ``` [`sayakpaul/sd-model-finetuned-lora-t4`](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4) contains [LoRA fine-tuned update matrices](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) -which is only 3 MBs in size. During inference, the pre-trained Stable Diffusion checkpoints loaded alongside these update +which is only 3 MBs in size. During inference, the pre-trained Stable Diffusion checkpoints are loaded alongside these update matrices and then they are combined to run inference. Inference for DreamBooth training remains the same. Check diff --git a/examples/text_to_image/README.md b/examples/text_to_image/README.md index 9d7cbdf30d34..31b00e943241 100644 --- a/examples/text_to_image/README.md +++ b/examples/text_to_image/README.md @@ -191,7 +191,7 @@ image.save("pokemon.png") For faster training on TPUs and GPUs you can leverage the flax training example. Follow the instructions above to get the model and dataset before running the script. -**___Note: The flax example don't yet support features like gradient checkpoint, gradient accumulation etc, so to use flax for faster training we will need >30GB cards.___** +**___Note: The flax example doesn't yet support features like gradient checkpoint, gradient accumulation etc, so to use flax for faster training we will need >30GB cards or TPU v3.___** Before running the scripts, make sure to install the library's training dependencies: From 72814aa9959ef57d1fda6a1c0180d5607482860d Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Wed, 25 Jan 2023 09:40:34 +0530 Subject: [PATCH 3/5] apply PR suggestions. --- docs/source/en/training/lora.mdx | 39 +++++++++++++++++++++++++++++--- 1 file changed, 36 insertions(+), 3 deletions(-) diff --git a/docs/source/en/training/lora.mdx b/docs/source/en/training/lora.mdx index bc11e7aa99ad..ca2536989b09 100644 --- a/docs/source/en/training/lora.mdx +++ b/docs/source/en/training/lora.mdx @@ -22,14 +22,19 @@ to existing weights and **only** training those newly added weights. This has a - Previous pretrained weights are kept frozen so that the model is not so prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114). - Rank-decomposition matrices have significantly fewer parameters than the original model, which means that trained LoRA weights are easily portable. -- LoRA attention layers allow to control to which extent the model is adapted toward new training images via a `scale` parameter. +- LoRA matrices are generally added to the attention layers of the original model and they control to control to which extent the model is adapted toward new training images via a `scale` parameter. + +**__Note that the usage of LoRA is not limited to only attention layers. In the original LoRA work, the authors found out that just ammending +the attention layers of a language model is sufficient to obtain good downstream performance with great efficiency. This is why, it's common +to just add the LoRA weights to the attention layers of a model.__** [cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository. LoRA allows us to achieve greater memory efficiency since the pretrained weights are kept frozen and only the LoRA weights are trained, thereby -allowing us to run fine-tuning on consumer GPUs like Tesla T4, RTX 3080 or even RTX 2080 Ti! +allowing us to run fine-tuning on consumer GPUs like Tesla T4, RTX 3080 or even RTX 2080 Ti! One can get access to GPUs like T4 in the free +tiers of Kaggle Kernels and Google Colab Notebooks. @@ -73,6 +78,9 @@ accelerate launch train_dreambooth_lora.py \ --push_to_hub ``` +A similar process can be followed to fully fine-tune Stable Diffusion on a custom dataset using the +`examples/text_to_image/train_text_to_image_lora.py` script. + Refer to the respective examples linked above to learn more. @@ -111,15 +119,40 @@ pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", pipe.unet.load_attn_procs(model_path) pipe.to("cuda") -prompt = "A pokemon with green eyes and red legs." +prompt = "A pokemon with blue eyes." image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0] image.save("pokemon.png") ``` +Here are some example images you can expect: + +
+ +
+ [`sayakpaul/sd-model-finetuned-lora-t4`](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4) contains [LoRA fine-tuned update matrices](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) which is only 3 MBs in size. During inference, the pre-trained Stable Diffusion checkpoints are loaded alongside these update matrices and then they are combined to run inference. + + +You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to retrieve the base model +from [`sayakpaul/sd-model-finetuned-lora-t4`](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4) like so: + +```py +from huggingface_hub.repocard import RepoCard + +card = RepoCard.load("sayakpaul/sd-model-finetuned-lora-t4") +base_model = card.data.to_dict()["base_model"] +# 'CompVis/stable-diffusion-v1-4' +``` + +And then you can use `pipe = StableDiffusionPipeline.from_pretrained(base_model, torch_dtype=torch.float16)`. + +This is especially useful when you don't want to hardcode the base model identifier during initializing the `StableDiffusionPipeline`. + + + Inference for DreamBooth training remains the same. Check [this section](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#inference-1) for more details. From 7f23db648306493da83213e1cf3fedecd79a63de Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Wed, 25 Jan 2023 13:42:55 +0530 Subject: [PATCH 4/5] Apply suggestions from code review Co-authored-by: Pedro Cuenca --- docs/source/en/training/lora.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/training/lora.mdx b/docs/source/en/training/lora.mdx index ca2536989b09..c0bad7c7035d 100644 --- a/docs/source/en/training/lora.mdx +++ b/docs/source/en/training/lora.mdx @@ -22,9 +22,9 @@ to existing weights and **only** training those newly added weights. This has a - Previous pretrained weights are kept frozen so that the model is not so prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114). - Rank-decomposition matrices have significantly fewer parameters than the original model, which means that trained LoRA weights are easily portable. -- LoRA matrices are generally added to the attention layers of the original model and they control to control to which extent the model is adapted toward new training images via a `scale` parameter. +- LoRA matrices are generally added to the attention layers of the original model and they control to which extent the model is adapted toward new training images via a `scale` parameter. -**__Note that the usage of LoRA is not limited to only attention layers. In the original LoRA work, the authors found out that just ammending +**__Note that the usage of LoRA is not just limited to attention layers. In the original LoRA work, the authors found out that just amending the attention layers of a language model is sufficient to obtain good downstream performance with great efficiency. This is why, it's common to just add the LoRA weights to the attention layers of a model.__** From 76562ea4763bb51e4b9c92b06a3806291d8433f6 Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Wed, 25 Jan 2023 13:45:19 +0530 Subject: [PATCH 5/5] remove visually incoherent elements. --- docs/source/en/training/lora.mdx | 6 ------ 1 file changed, 6 deletions(-) diff --git a/docs/source/en/training/lora.mdx b/docs/source/en/training/lora.mdx index c0bad7c7035d..e863e9d56d86 100644 --- a/docs/source/en/training/lora.mdx +++ b/docs/source/en/training/lora.mdx @@ -126,16 +126,12 @@ image.save("pokemon.png") Here are some example images you can expect: -
-
[`sayakpaul/sd-model-finetuned-lora-t4`](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4) contains [LoRA fine-tuned update matrices](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/pytorch_lora_weights.bin) which is only 3 MBs in size. During inference, the pre-trained Stable Diffusion checkpoints are loaded alongside these update matrices and then they are combined to run inference. - - You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to retrieve the base model from [`sayakpaul/sd-model-finetuned-lora-t4`](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4) like so: @@ -151,8 +147,6 @@ And then you can use `pipe = StableDiffusionPipeline.from_pretrained(base_model, This is especially useful when you don't want to hardcode the base model identifier during initializing the `StableDiffusionPipeline`. - - Inference for DreamBooth training remains the same. Check [this section](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#inference-1) for more details.