-
Notifications
You must be signed in to change notification settings - Fork 5.9k
run train_dreambooth_lora.py failed with accelerate #3284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
i got same error and i also got problem when i run it also happens with accelerate and when i run with
|
i'm not sure why error occurs with accelerate, but it can be fixed by modfiy following lines of here class AttnProcsLayers(torch.nn.Module):
def __init__(self, state_dict: Dict[str, torch.Tensor]):
super().__init__()
self.layers = torch.nn.ModuleList(state_dict.values())
self.mapping = dict(enumerate(state_dict.keys()))
self.rev_mapping = {v: k for k, v in enumerate(state_dict.keys())}
# we add a hook to state_dict() and load_state_dict() so that the
# naming fits with `unet.attn_processors`
def map_to(module, state_dict, *args, **kwargs):
new_state_dict = {}
for key, value in state_dict.items():
layer_index = 2 if 'module' in key else 1 ## you should add this line
num = int(key.split(".")[layer_index]) # 0 is always "layers"
new_key = key.replace(f"layers.{num}", module.mapping[num])
new_state_dict[new_key] = value
return new_state_dict this is because this function is called by here when we run the code with accelerate if accelerator.is_main_process:
save_path = os.path.join(args.output_dir, f"checkpoint-{global_step}")
# We combine the text encoder and UNet LoRA parameters with a simple
# custom logic. `accelerator.save_state()` won't know that. So,
# use `LoraLoaderMixin.save_lora_weights()`.
LoraLoaderMixin.save_lora_weights(
save_directory=save_path,
unet_lora_layers=unet_lora_layers,
text_encoder_lora_layers=text_encoder_lora_layers,
) should be def map_to(module, state_dict, *args, **kwargs):
new_state_dict = {}
for key, value in state_dict.items():
# num = int(key.split(".")[layer_index]) # 0 is always "layers"
# new_key = key.replace(f"layers.{num}", module.mapping[num])
if 'module' in key:
num = int(key.split(".")[2])
replace_key = f"module.layers.{num}"
else:
num = int(key.split(".")[1])
replace_key = f"layers.{num}"
new_key = key.replace(replace_key, module.mapping[num])
new_state_dict[new_key] = value so you can load |
these are errors: Traceback (most recent call last):
File "/data/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1093, in <module>
main(args)
File "/data/diffusers/examples/dreambooth/train_dreambooth_lora.py", line 1048, in main
pipeline.load_lora_weights(args.output_dir)
File "/data/diffusers/src/diffusers/loaders.py", line 851, in load_lora_weights
self.unet.load_attn_procs(unet_lora_state_dict)
File "/data/diffusers/src/diffusers/loaders.py", line 305, in load_attn_procs
self.set_attn_processor(attn_processors)
File "/data/diffusers/src/diffusers/models/unet_2d_condition.py", line 533, in set_attn_processor
fn_recursive_attn_processor(name, module, processor)
File "/data/diffusers/src/diffusers/models/unet_2d_condition.py", line 530, in fn_recursive_attn_processor
fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
File "/data/diffusers/src/diffusers/models/unet_2d_condition.py", line 530, in fn_recursive_attn_processor
fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
File "/data/diffusers/src/diffusers/models/unet_2d_condition.py", line 530, in fn_recursive_attn_processor
fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
[Previous line repeated 3 more times]
File "/data/diffusers/src/diffusers/models/unet_2d_condition.py", line 527, in fn_recursive_attn_processor
module.set_processor(processor.pop(f"{name}.processor"))
KeyError: 'down_blocks.0.attentions.0.transformer_blocks.0.attn1.processor'
Steps: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 |
i told you that you should also edit replace_key. check my last code block |
@SeunghyunSEO I mean another error occurred. |
that's what i encountered error without this code def map_to(module, state_dict, *args, **kwargs):
new_state_dict = {}
for key, value in state_dict.items():
# num = int(key.split(".")[layer_index]) # 0 is always "layers"
# new_key = key.replace(f"layers.{num}", module.mapping[num])
if 'module' in key:
num = int(key.split(".")[2])
replace_key = f"module.layers.{num}"
else:
num = int(key.split(".")[1])
replace_key = f"layers.{num}"
new_key = key.replace(replace_key, module.mapping[num]) please check dictionary keys of lora weight with |
We need to be able to reproduce this error. I conducted some training runs with the following setups. First, I installed diffusers from source:
accelerate launch train_dreambooth_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=100 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=50 \
--seed="0" \
--push_to_hub Final checkpoints: https://huggingface.co/sayakpaul/dog-test-lora.
accelerate launch train_dreambooth_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=100 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=50 \
--seed="0" \
--train_text_encoder \
--push_to_hub Final checkpoints: https://huggingface.co/sayakpaul/dreambooth-text-encoder-test. I was able to conduct the above training runs without any failures. What am I missing out on? |
@sayakpaul |
i installed main branch of diffusers from source |
For |
I'm running into the same thing with a single GPU using DeepSpeed. The above fix worked for me in that it prevented the initial error, but I get the same error as @webliupeng in If I put a print statement in |
Could you try updating your local clone of As mentioned in #3284 (comment), I need to be able to reproduce the error minimally.
@SeunghyunSEO, could you also update your local clone of |
I did a git pull to efc48da and the issue persists. As for reproducing it, I downloaded the example dog dataset from the DreamBooth training example README here, enabled DeepSpeed per the same doc here, and ran the training script from the same section (using default_config.yaml.gz The fix still allows the loras to be saved, though still with the incorrect keys. I'm happy to provide system info, if any of that is useful. |
What happens when you disable DeepSpeed? Also, which method are you using to load the LoRA parameters obtained after training? |
It completes without error with the correct keys.
The error on load I referenced is from the end of the end of |
Okay, so, probably narrowing it down -- it pops up after using DeepSpeed. Ccing @williamberman here. |
Note here that DeepSpeed is still very much experimental and we probably won't have time to look into this more specifically. More than happy to review a PR though |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
No deepspeed and issue persists. Pain in the behind, because everytime I step away for a while, I end up losing the hacked in fix and have to rediscover all this crap. |
Hi @markrmiller. I can understand the frustration. If you happen to know any fixes that have worked for you, happy to review any PR from you so that we can get it fixed. But currently, we stand here: #3284 (comment). |
Unfortunetly, I don't know what's behind it. I'm not, and have not used deepspeed though. I've also tried a large mix of accerate and diffusers versions, currently on latest source builds for both. I'm using the latest dreambooth-lora script - it fails as soon as it tries to save a checkpoint. I can add that the previous fix above didn't actually work for me this time, and I had to expand it to this:
So it's related to these keys mappings, but why I see it and someone else doesn't, I have no clue. As mentioned, if I don't run with accelerate, it doesn't happen. |
When the checkpointing part works fine with that script with regular accelerate. This I can confirm because I ran it yesterday. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Describe the bug
Thanks for this awesome project!
When I run the script "train_dreambooth_lora.py" without acceleration, it works fine. But when I use acceleration launch, it fails when the number of steps reaches "checkpointing_steps".
I am running the script in a Docker with 4 * 3090 vGPUs. And I ran accelerate test, it's successed.
I am new to this and would appreciate any guidance or suggestions you can offer.
Reproduction
Logs
System Info
diffusers
version: 0.17.0.dev0Platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.31
Python version: 3.10.9
PyTorch version (GPU?): 2.0.0+cu117 (True)
Huggingface_hub version: 0.14.0
Transformers version: 4.25.1
Accelerate version: 0.18.0
xFormers version: 0.0.19
Using GPU in script?:
Using distributed or parallel set-up in script?:
Accelerate
default config:- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
The text was updated successfully, but these errors were encountered: