Skip to content

Training example of controlNet yield error #3101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
svjack opened this issue Apr 14, 2023 · 8 comments
Closed

Training example of controlNet yield error #3101

svjack opened this issue Apr 14, 2023 · 8 comments
Assignees
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@svjack
Copy link

svjack commented Apr 14, 2023

Describe the bug

I try training controlnet in my dataset "https://huggingface.co/datasets/svjack/diffusiondb_100_canny_zh"
with small gpu memory config as following

Reproduction

export MODEL_DIR="IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1"
export OUTPUT_DIR="TSD_save"

accelerate launch train_controlnet.py \
 --pretrained_model_name_or_path=$MODEL_DIR \
 --output_dir=$OUTPUT_DIR \
 --dataset_name=svjack/diffusiondb_100_canny_zh \
 --resolution=512 \
 --learning_rate=1e-5 \
 --train_batch_size=1 \
 --gradient_accumulation_steps=1 \
 --gradient_checkpointing \
 --use_8bit_adam \
 --tracker_project_name canny  \
 --set_grads_to_none \
 --conditioning_image_column guide \
 --caption_column zh_text \
 --mixed_precision fp16

Logs

Traceback (most recent call last):
  File "train_controlnet.py", line 1051, in <module>
    main(args)
  File "train_controlnet.py", line 970, in main
    return_dict=False,
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/environment/miniconda3/lib/python3.7/site-packages/accelerate/utils/operations.py", line 495, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/environment/miniconda3/lib/python3.7/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/environment/miniconda3/lib/python3.7/site-packages/diffusers/models/controlnet.py", line 519, in forward
    sample += controlnet_cond
RuntimeError: The size of tensor a (85) must match the size of tensor b (86) at non-singleton dimension 2
Steps:  88%|██████████████████████████████████████████████████████████████████████████████████████████▋            | 88/100 [00:44<00:06,  2.00it/s, loss=0.013, lr=1e-5]
Traceback (most recent call last):
  File "/environment/miniconda3/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/environment/miniconda3/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/environment/miniconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 923, in launch_command
    simple_launcher(args)
  File "/environment/miniconda3/lib/python3.7/site-packages/accelerate/commands/launch.py", line 579, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/environment/miniconda3/bin/python', 'train_controlnet.py', '--pretrained_model_name_or_path=Taiyi-Stable-Diffusion-1B-Chinese-v0.1', '--output_dir=TSD_save', '--dataset_name=svjack/diffusiondb_100_canny_zh', '--resolution=512', '--learning_rate=1e-5', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--use_8bit_adam', '--tracker_project_name', 'canny', '--set_grads_to_none', '--mixed_precision', 'fp16']' returned non-zero exit status 1.

System Info

In the newest version of diffusers of python3.7 in A4000

@svjack svjack added the bug Something isn't working label Apr 14, 2023
@sayakpaul
Copy link
Member

If the same training script runs with runwayml/stable-diffusion-v1-5 as the base model, I suspect the model you are providing to pretrained_model_name_or_path is causing this issue.

Ccing @williamberman and @yiyixuxu.

@williamberman williamberman self-assigned this Apr 17, 2023
@williamberman
Copy link
Contributor

I believe the default resizing code in the training script is not resizing to a multiple of 8 causing the encoded image to have different height/width dimensions than the encoded conditioning image (which uses a separate encoder that's part of the controlnet model)

@svjack
Copy link
Author

svjack commented Apr 18, 2023

I believe the default resizing code in the training script is not resizing to a multiple of 8 causing the encoded image to have different height/width dimensions than the encoded conditioning image (which uses a separate encoder that's part of the controlnet model)

i will check this

@williamberman
Copy link
Contributor

@svjack I think you took your dataset off the hub so I can't test 😁

@svjack
Copy link
Author

svjack commented Apr 19, 2023

@svjack I think you took your dataset off the hub so I can't test 😁

I have tried the code you fork from main branch of diffusers
it failed
but i resize all my images to (512 512)
it works

@williamberman
Copy link
Contributor

hey @svjack could you elaborate on what went wrong?

@svjack
Copy link
Author

svjack commented Apr 19, 2023

hey @svjack could you elaborate on what went wrong?

may be the image_transforms part as you say.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

3 participants