- This document provides instructions for fine-tuning the CogvideoX model.
- It supports both text-to-video and image-to-video.
- It supports both full fine-tuning and lora fine-tuning.
- Install the videotuna environment (see Installation).
- Download the CogvideoX checkpoints (see docs/checkpoints).
- Download the example training data.
You can download manually from this link, or download via
wget:Make sure the data is putted atwget https://huggingface.co/datasets/Yingqing/VideoTuna-Datasets/resolve/main/apply_lipstick.zip cd data unzip apply_lipstick.zip -d apply_lipstickdata/apply_lipstick/metadata.csv
Lora Fine-tuning of CogVideoX Text-to-Video:
-
Run the commands in the terminal to launch training.
bash shscripts/train_cogvideox_t2v_lora.sh -
After training, run the commands to inference your personalized models.
bash shscripts/inference_cogvideo_t2v_lora.sh- You need to provide the checkpoint path to the
ckptargument in the above shell script.
Note:
- The training and inference use the default model config from
configs/004_cogvideox/cogvideo5b.yaml
- You need to provide the checkpoint path to the
Lora Fine-tuning of CogVideoX Image-to-Video:
-
Run the commands in the terminal to launch training.
bash shscripts/train_cogvideox_i2v_lora.sh -
After training, run the commands to inference your personalized models.
bash shscripts/inference_cogvideo_i2v_lora.sh- You need to provide the checkpoint path to the
ckptargument in the above shell script.
Note:
- The training and inference use the default model config from
configs/004_cogvideox/cogvideo5b-i2v.yaml
- You need to provide the checkpoint path to the
Full Fine-tuning of CogVideoX Text-to-Video:
-
Run the commands in the terminal to launch training.
bash shscripts/train_cogvideox_t2v_fullft.shWe tested on 4 H800 GPUs. The training requires 68GB GPU memory.
-
After training, run the commands to inference your personalized models.
shscripts/inference_cogvideo_t2v_fullft.sh- You need to provide the checkpoint path to the
ckptargument in the above shell script. Because the full fine-tuning uses deepspeed to reduce GPU memory, so the checkpoint is like${exp_save_dir}/checkpoints/trainstep_checkpoints/epoch=xxxxxx-step=xxxxxxxxx.ckpt/checkpoint/mp_rank_00_model_states.pt
Note:
- The training and inference use the default model config from
configs/004_cogvideox/cogvideo5b-i2v-fullft.yaml
- You need to provide the checkpoint path to the
Full Fine-tuning of CogVideoX Image-to-Video:
Same as above full fine-tuning of text-to-video.
- Training:
bash shscripts/train_cogvideox_i2v_fullft.sh
- Inference:
shscripts/inference_cogvideo_i2v_fullft.sh