This guide provides instructions on running post-training with Cosmos-Predict2 Video2World models.
- Prerequisites
- Preparing Data
- Post-training
- Inference with the Post-trained checkpoint
- Inference for DreamGen Benchmark
Before running training:
- Environment setup:
- Follow the Setup guide for installation instructions.
- For user who want to run the command in https://github.com/nvidia/GR00T-dreams, after setup the environment, run the following command to install extra dependencies:
# If use Docker and see ERROR: Cannot install httpcore==1.0.7 because these package versions have conflicting dependencies
# The following command may help resolve the package version conflict:
# grep -v "^h11==" /etc/pip/constraint.txt > /etc/pip/constraint_new.txt && mv /etc/pip/constraint_new.txt /etc/pip/constraint.txt
# grep -v "^httpcore==" /etc/pip/constraint.txt > /etc/pip/constraint_new.txt && mv /etc/pip/constraint_new.txt /etc/pip/constraint.txt
pip install openai tyro numpydantic albumentations tianshou git+https://github.com/facebookresearch/pytorch3d.git- Model checkpoints: Download required model weights following the Downloading Checkpoints section in the Setup guide.
- Hardware considerations: Review the Performance guide for GPU requirements and model selection recommendations.
Example of the training data for the GR1 and DROID models:
| Dataset | Model Weight | Text prompt | Training video |
|---|---|---|---|
| GR1 | 🤗 Huggingface | Use the left hand to pick up red milk carton from teal bowl to pink plate. | 453747141-b8ee46fa-8b65-4018-a968-ed6895e9063c.mp4 |
| DROID | 🤗 Huggingface | A multi-view video shows that a robot put the marker on the table The video is split into four views: The top-left view shows the robotic arm from the left side, the top-right view shows it from the right side, the bottom-left view shows a first-person perspective from the robot's end-effector (gripper), and the bottom-right view is a black screen (inactive view). The robot put the marker on the table | 453747409-18867f08-d5cc-43e6-a507-0dd01dd190a9.mp4 |
For training on the robotic training datasets from the DreamGen paper, please use the following command to download the GR1 training dataset from https://huggingface.co/datasets/nvidia/GR1-100
under cosmos-predict2/ folder, run:
# This command will download the videos for physical AI
huggingface-cli download nvidia/GR1-100 --repo-type dataset --local-dir datasets/benchmark_train/hf_gr1/ && \
mkdir -p datasets/benchmark_train/gr1/videos && \
mv datasets/benchmark_train/hf_gr1/gr1/*mp4 datasets/benchmark_train/gr1/videos && \
mv datasets/benchmark_train/hf_gr1/metadata.csv datasets/benchmark_train/gr1/
Run the following command to pre-compute T5-XXL embeddings for the video captions used for post-training:
# The script will use the provided prompt, save the T5-XXL embeddings in pickle format.
python -m scripts.get_t5_embeddings_from_groot_dataset --dataset_path datasets/benchmark_train/gr1Dataset folder format:
datasets/benchmark_train/gr1/
├── metas/
│ ├── *.txt
├── videos/
│ ├── *.mp4
├── t5_xxl/
│ ├── *.pickle
Run the following command to execute an example post-training job with GR1 data.
EXP=predict2_video2world_training_2b_groot_gr1_480
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}The model will be post-trained using the GR1 dataset.
See the config predict2_video2world_training_2b_groot_gr1_480 defined in cosmos_predict2/configs/base/experiment/groot.py to understand how the dataloader is defined.
# GROOT example
example_video_dataset_gr1 = L(Dataset)(
dataset_dir="datasets/benchmark_train/gr1",
num_frames=93,
video_size=(432, 768),
)
dataloader_train_gr1 = L(DataLoader)(
dataset=example_video_dataset_gr1,
sampler=L(get_sampler)(dataset=example_video_dataset_gr1),
batch_size=1,
drop_last=True,
num_workers=8,
pin_memory=True,
)The checkpoints will be saved to checkpoints/PROJECT/GROUP/NAME.
In the above example, PROJECT is posttraining, GROUP is video2world, NAME is 2b_groot_gr1_480.
See the job config to understand how they are determined.
predict2_video2world_training_2b_groot_gr1_480 = dict(
dict(
...
job=dict(
project="posttraining",
group="video2world",
name="2b_groot_gr1_480",
),
...
)
)The checkpoints will be saved in the below structure.
checkpoints/posttraining/video2world/2b_groot_gr1_480/checkpoints/
├── model/
│ ├── iter_{NUMBER}.pt
├── optim/
├── scheduler/
├── trainer/
├── latest_checkpoint.txt
Run the following command to execute an example post-training job with GR1 data with 4 nodes with 8 GPUs.
EXP=predict2_video2world_training_14b_groot_gr1_480
NVTE_FUSED_ATTN=0 torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id 123 --rdzv_backend c10d --rdzv_endpoint $MASTER_ADDR:1234 \
-m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}- Optionally, you could load the
Cosmos-Predict2-14B-Video2World-GR00T-Dreams-GR1checkpoint for initialization, by appendingmodel.config.model_manager_config.dit_path=checkpoints/nvidia/Cosmos-Predict2-14B-Video2World-Sample-GR00T-Dreams-GR1/model-480p-16fps.ptto the above command.
checkout inference_video2world.md for more examples of how to use the inference script.
- Inference with GR1 checkpoint
torchrun --nproc_per_node=8 --master_port=12341 \
-m examples.video2world_gr00t \
--model_size 14B \
--gr00t_variant gr1 \
--prompt "Use the right hand to pick up rubik\'s cube from from the bottom of the three-tiered wooden shelf to to the top of the three-tiered wooden shelf." \
--input_path assets/sample_gr00t_dreams_gr1/8_Use_the_right_hand_to_pick_up_rubik\'s_cube_from_from_the_bottom_of_the_three-tiered_wooden_shelf_to_to_the_top_of_the_three-tiered_wooden_shelf..png \
--num_gpus 8 \
--prompt_prefix "" \
--save_path output/generated_video_gr1.mp4 \- Inference with DROID checkpoint
torchrun --nproc_per_node=8 --master_port=12341 \
-m examples.video2world_gr00t \
--model_size 14B \
--gr00t_variant droid \
--prompt "A multi-view video shows that a robot pick the lid and put it on the pot The video is split into four views: The top-left view shows the robotic arm from the left side, the top-right view shows it from the right side, the bottom-left view shows a first-person perspective from the robot's end-effector (gripper), and the bottom-right view is a black screen (inactive view). The robot pick the lid and put it on the pot" \
--input_path assets/sample_gr00t_dreams_droid/episode_000408.png \
--prompt_prefix "" \
--num_gpus 8 \
--save_path output/generated_video_droid.mp4Example of the inference output:
- The following command will download the DreamGen Benchmark dataset from https://huggingface.co/datasets/nvidia/EVAL-175
huggingface-cli download nvidia/EVAL-175 --repo-type dataset --local-dir dream_gen_benchmark
python -m scripts.prepare_batch_input_json \
--dataset_path dream_gen_benchmark/gr1_object/ \
--save_path output/dream_gen_benchmark/cosmos_predict2_14b_gr1_object/ \
--output_path dream_gen_benchmark/gr1_object/batch_input.jsonpython -m examples.video2world_gr00t \
--model_size 14B \
--gr00t_variant gr1 \
--batch_input_json dream_gen_benchmark/gr1_object/batch_input.json \
--disable_guardrail- Note: For full evaluation without missing videos, it's better to turn off the guardrail checks (add
--disable_guardrailto the command) to make sure all the videos are generated. - See documentations/inference_video2world.md for inference run details.
checkout inference_video2world.md and Cosmos-Reason1 video critic instruction for more examples on how to improve video quality using Cosmos-Reason1's video critic capability. Refer to API Documentation for detailed usage of video2world_bestofn.py.
- Inference with GR1 checkpoint and rejection sampling
torchrun --nproc_per_node=8 --master_port=12341 \
-m examples.video2world_bestofn \
--model_size 14B \
--gr00t_variant gr1 \
--prompt "Use the right hand to pick up rubik\'s cube from from the bottom of the three-tiered wooden shelf to to the top of the three-tiered wooden shelf." \
--input_path assets/sample_gr00t_dreams_gr1/8_Use_the_right_hand_to_pick_up_rubik\'s_cube_from_from_the_bottom_of_the_three-tiered_wooden_shelf_to_to_the_top_of_the_three-tiered_wooden_shelf..png \
--num_gpus 8 \
--num_generations 4 \
--prompt_prefix "" \
--disable_guardrail \
--save_path output/best-of-n-gr00t-gr1
