This repo contains PyTorch model definitions, pre-trained weights and inference/sampling code for our experiments on HunyuanVideo Keyframe Control Lora.
- π₯π₯π₯ News!!
- Abstract
- Demo
- π Recommeded Settings
- π οΈ Dependencies and Installation
- π Inference
- π Training
- 27 February 2025: We release the training code of HunyuanVideo Keyframe Control Lora and Blog.
- 24 February 2025: We release the inference code and model weights of HunyuanVideo Keyframe Control Lora . Download.
HunyuanVideo Keyframe Control Lora is an adapter for HunyuanVideo T2V model for keyframe-based video generation. βOur architecture builds upon existing models, introducing key enhancements to optimize keyframe-based video generation:β
- We modify the input patch embedding projection layer to effectively incorporate keyframe information. By adjusting the convolutional input parameters, we enable the model to process image inputs within the Diffusion Transformer (DiT) framework.β
- We apply Low-Rank Adaptation (LoRA) across all linear layers and the convolutional input layer. This approach facilitates efficient fine-tuning by introducing low-rank matrices that approximate the weight updates, thereby preserving the base model's foundational capabilities while reducing the number of trainable parameters.
- The model is conditioned on user-defined keyframes, allowing precise control over the generated video's start and end frames. This conditioning ensures that the generated content aligns seamlessly with the specified keyframes, enhancing the coherence and narrative flow of the video.β
Click on the first column images to view the generated videos
Generated Video | Image 1 | Image 2 |
---|---|---|
- The model works best on human subjects. Single subject images work slightly better.
- It is recommended to use the following image generation resolutions
720x1280
,544x960
,1280x720
,960x544
. - It is recommended to set frames from 33 upto 97. Can go upto 121 frames as well (but not tested much).
- Prompting helps a lot but works even without. The prompt can be as simple as just the name of the object you want to generate or can be detailed.
num_inference_steps
is recommended to be 50, but for fast results you can use 30 as well. Anything less than 30 is not recommended.
Begin by cloning the repository:
git clone https://github.com/dashtoon/hunyuan-video-keyframe-control-lora.git
cd hunyuan-video-keyframe-control-lora
We recommend CUDA versions 12.4
Conda's installation instructions are available here.
bash setup_env.sh
The model weights can be downloaded from Huggingface
You can run inference using the provided script. The script uses flash_attn
but can also be modified to use sage_attn
. Running the below command will output a video that is saved in output.mp4
- An NVIDIA GPU with CUDA support is required.
- The model is tested on a single 80G GPU.
- Minimum: The minimum GPU memory required is ~60GB for 720px1280px129f and ~45G for 544px960px129f.
- Recommended: We recommend using a GPU with 80GB of memory for better generation quality.
- Tested operating system: Linux
export BASE_MODEL="hunyuanvideo-community/HunyuanVideo"
export LORA_PATH="<PATH TO DOWNLOADED CONTROL LORA>"
export IMAGE_1="<PATH TO THE FIRST FRAME>"
export IMAGE_2="<PATH TO THE LAST FRAME>"
export PROMPT="<A BEAUTIFUL PROMPT>"
export HEIGHT=960
export WIDTH=544
export n_FRAMES=33
python hv_control_lora_inference.py \
--model $BASE_MODEL \
--lora $LORA_PATH \
--frame1 $IMAGE_1 --frame2 $IMAGE_2 --prompt "$PROMPT" --frames $n_FRAMES \
--height $HEIGHT --width $WIDTH \
--steps 50 \
--guidance 6.0 \
--seed 123143153 \
--output output.mp4
It is recommended to have atleast 1 GPU with 80GB of VRAM. We use mosaic-ml streaming for caching our data. We expect our original data in the following format. Running the tree command, you should see:
dataset
βββ metadata.csv
βββ videos
βββ 00000.mp4
βββ 00001.mp4
βββ ...
The csv can contain any number of columns, but due to limited support at the moment, we only make use of prompt and video columns. The CSV should look like this:
caption,video_file,other_column1,other_column2
A black and white animated sequence featuring a rabbit, named Rabbity Ribfried, and an anthropomorphic goat in a musical, playful environment, showcasing their evolving interaction.,videos/00000.mp4,...,...
For the above format you would run the following command for starting to cache the dataset:
python tools/hv_cache_dataset.py \
--csv "dataset/metadata.csv" \
--base_dir "dataset" \
--video_column video_file \
--caption_column "caption" \
--output_dir "dataset/mds_cache" \
--bucket_reso \
"1280x720x33" "1280x720x65" "1280x720x97" "960x544x33" "960x544x65" "960x544x97" \
"720x1280x33" "720x1280x65" "720x1280x97" "544x960x33" "544x960x65" "544x960x97" \
--min_bucket_count 100 \
--head_frame 0
bucket_reso
: this specifies the bucket resolutions to train on in the format of WxHxF.head_frame
: the intial frame from where to start extracting from a video
NOTE: It is recommened to first convert your video into separate scenes and ensure there is continuity between scenes. This is a good starting point for video dataset preparation.
The next commanded will start caching the LLM embeds and the VAE states.
NUM_GPUS=8
MIXED_PRECISION="bf16"
accelerate launch --num_processes=$NUM_GPUS --mixed_precision=$MIXED_PRECISION --main_process_port=12345 \
tools/hv_precompute_latents_dist.py \
--pretrained_model_name_or_path="hunyuanvideo-community/HunyuanVideo" \
--mds_data_path "dataset/mds_cache" \
--output_dir "dataset/mds_cache_latents" \
--recursive
Now you need to add the path to all the mds latent folders in ./configs/config_defaults.yaml
config file under data.local
as a list. The latent_cache should be stored unfer --output_dir
folder as 1280x720x33_00
folders. Where 1280
is the width of the video, 720
is the height of the video and 33
is the framerate of the video and 00
is the gpu id.
Now we are ready to start training!
NUM_GPUS=8
MIXED_PRECISION="bf16"
EXPERIMENT_NAME="my_first_run"
OUTPUT_DIR="outputs/"
CONFIG_PATH="./configs/config_defaults.yaml"
NUM_EPOCHS=1
accelerate launch --num_processes=$NUM_GPUS --mixed_precision=$MIXED_PRECISION --main_process_port=12345 \
hv_train_control_lora.py \
--config_path $CONFIG_PATH \
--experiment.run_id=$EXPERIMENT_NAME \
--experiment.output_dirpath=$OUTPUT_DIR \
--network.train_norm_layers=False \
--network.lora_dropout=0.05 \
--hparams.ema.use_ema=False \
--hparams.num_train_epochs=1
-
We would like to thank the contributors to the SD3, FLUX, Llama, LLaVA, Xtuner, diffusers and HuggingFace repositories, for their open research and exploration.
-
We build on top of a body of great open-source libraries: transformers, accelerate, peft, diffusers, bitsandbytes, torchao, deepspeed, mosaicml-streaming -- to name a few.