-
Notifications
You must be signed in to change notification settings - Fork 6k
[LoRA] Discussions on ensuring robust LoRA support in Diffusers #3620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks a lot for the great summary! I agree with all the points except for 7. where it'd like to wait a bit since it don't (yet) see the importance of having multiple LoRAs loaded into the model at once. Let me open a quick draft PR for 6. and link it to #3480 |
Same. |
i have been experimenting with SD 2.1 fine-tuning and my results show that tuning the text encoder is pretty important but also dangerous and easily pushed over some kind of numeric cliff of sorts into the "catastrophic forgetting" territory. i have mostly started freezing all but the last 4 to 7 layers of OpenCLIP during fine-tuning, and this is where i've supplied about 30,000 image-caption pairs of high quality images and captions, done by hand, by human volunteers. the learning rate is very important for the text encoder. far more than the unet. and the ones inside the current Diffusers code for we need to wrap the optimizer in a class that allows emulation of a single target, but using two learning rate schedulers internally, so that we can train the text encoder more slowly than the unet. my current workarounds are, freezing the TE, stopping training of it about 25% through the final run (25k steps out of 100k steps), and, providing a large amount of "balance" images that are the highest quality human photos out of the dataset, fed at a 20% rate compared to the "real training data". this ensured the least amount of forgetting at the end, with the highest quality results. additionally, i've added the patched betas scheduler into my training scripts @ bghira/SimpleTuner which i have derived from these examples. this enabled "enforced terminal SNR" which drastically improved the perceived quality of the outputs. |
Thanks for sharing these insights! Feel free also share a link to your repo and some visual results. |
We are training custom LoRAs for different characters in a story. So while many LoRA users will just be applying a global style, this is super important for us. |
Thanks for the answer @jelling, so I guess the important part is to be able to quickly set/unset different LoRA no? diffusers/src/diffusers/loaders.py Line 781 in cd9d091
Would the following solution be ok for your case @jelling: from diffusers import DiffusionPipeline
from diffusers.utils import _get_model_file
lora_repo_ids = [ ] # list all LoRA repo ids here
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
def load_state_dict(repo_id):
model_file = _get_model_file(repo_id, "pytorch_lora_weights.bin")
state_dict = torch.load(model_file, map_location="cpu")
return state_dict
# make sure all LoRA weights are in RAM
lora_state_dicts = {k: load_state_dict(k) for k in lora_repo_ids}
pipe.load_lora_weights(lora_state_dict["<first-character>"])
pipe(...)
pipe.load_lora_weights(lora_state_dict["<second-character>"]) => this way should be pretty efficient. Would this work for you? |
I want to advocate for step 7. Many people are mixing Loras in A1111 with very interesting results. It would be nice to have many loaded at one time and pass in weights during inference. You could set weights to zero for Loras that you did not want activated. |
@patrickvonplaten this is what I'm trying to accomplish but with diffusers. In your example above, it looks like you are showing how to quickly switch between LoRAs. This is a good feature - and one I was curious about - but we need to run multiple LoRAs at the same time on the same inference. I.e. two character LoRAs would be used to generate a single image. |
This is something we're actively watching. Upon sufficient request, we'll start brainstorming about it or might even rely on |
Congrats for supporting A1111 LoRA format 👏 |
@sayakpaul could you tell me anything about what's entailed in adding support? It's important enough for us that we might try adding support ourselves, if it comes to do it. I haven't gone through the multi-LoRA section of automatic1111 yet, but is there something about how they do it that incompatible with the general diffusers way of doing this? |
I think the main bottleneck is around the design i.e., IIUC, they merge the LoRA weights into the UNet. This is not how we do it in Diffusers. We make use of specific attention processor classes so that we can unload a LoRA and carry one. With the merging weights' design, it's relatively simpler but with our attention processor design, we need to be careful. |
Here is some pseudo code representing they way I wished LoRAs would work. Just an idea. I thought code would be better for explaining my thoughts.
In my imagination they are also very fast. Slowing down inference by at most 10%. |
This should be possible with |
Closing this since we introspected quite a bit. Thread on supporting multiple LoRAs will be a different one. |
What issue should we watch? Is there a different thread currently regarding multiple LoRA support? |
Not yet. Will begin soon. |
For the last few months, we have been collaborating with our contributors to ensure we support LoRA effectively and efficiently from Diffusers:
1. Training support
✅ DreamBooth (letting users perform LoRA fine-tuning of both UNet and text-encoder). There were some issues in the text encoder part which are now being fixed in #3437. Thanks to @takuma104.
✅ Vanilla text-to-image fine-tuning. We support only the fine-tuning of UNet with LoRA purposefully since here we'd assume that the number of image-caption pairs is higher than what is typically used for DreamBooth and therefore, text encoder fine-tuning is probably an overkill.
2. Interoperability
With #3437, we're introducing limited support for loading A1111 CivitAI checkpoints with
pipeline.load_lora_weights()
. This has been a widely requested feature (see #3064 as an example).We do provide a
convert_lora_safetensor_to_diffusers.py
script as well that allows for converting A1111 LoRA checkpoints (potentially non-exhaustive) and merging them to the text encoder and the UNet of aDiffusionPipeline
. However, this doesn't allow switching the attention processor back to the default one, unlike how it's currently in Diffusers. Check out https://huggingface.co/docs/diffusers/main/en/training/lora for more details. For inference-only and definitive workflows (where one doesn't need to switch attention processors), it caters to many use cases.3. xformers support for efficient inference
Once LoRA parameters are loaded into a pipeline, xformers should work seamlessly. There was apparently a problem with that and it's fixed in #3556.
4. PT 2.0 SDPA optimization
See: #3594
5.
torch.compile()
compatibility with LoRAOnce 4. is settled, we should be able to take advantage of
torch.compile()
.6. Introduction of
scale
for control the contributions from the text encoder LoRASee #3480. We already support passing
scale
as a part ofcross_attention_kwargs
for the UNet LoRA.7. Supporting multiple LoRAs
@takuma104 proposed a hook-based design here: #3064 (comment)
I hope this helps to provide a consolidated view of where we're at regarding supporting LoRA from Diffusers.
Cc: @pcuenca @patrickvonplaten
The text was updated successfully, but these errors were encountered: