We provide instruction to train or finetune PLM on a custom dataset.
We use support both image and video conversation datasets using jsonl
. Each line of jsonl
file should follow the following format,
{
"image": "<image path>",
"conversations": [
{
"from": "human",
"value": "human instruction"
},
{
"from": "assistant",
"value": "model response"
}
]
}
{
"video": "<video path>",
"conversations": [
{
"from": "human",
"value": " human instruction"
},
{
"from": "assistant",
"value": "model response"
}
]
}
Note that for images, we require the image
key to be present in the jsonl line
, while for videos we require the video
key to be present in the jsonl line
. The conversations
key is common between the two types.
Tip
The repo also support text-only
, multi-image
, image-region
, video-region-caption (RCap)
, video-region-temporal-localization (RTLoc)
and video-region-dense-captioning (RDCap)
tasks. Please download the provided dummy-datasets
for an example of each dataset.
Given the dataset jsonl
file, we can register a new dataset by adding an entry in apps/plm/configs/datasets.yaml
.
custom_dataset_name:
annotation: path/to/the/jsonl/file.jsonl
root_dir: path/to/the/image-or-video/root-dir
Please refer to apps/plm/configs/datasets.yaml
for already present dummy image, video and grounding datasets.
Training PLM involves creating a .yaml
configuration file, defining all model and training related configurable parameters. Please refer to the provided sft
for details.
Tip
To run the following code, download the dummy-datasets
and extract them to apps/plm/dummy_datasets
.
Given a .yaml
configuration file, please run the following command to launch the training on a single node with 8 GPUs.
torchrun --nproc-per-node 8 -m apps.plm.train config=apps/plm/configs/finetune/plm_3b.yaml
In order to run inference / evaluation, please consolidate checkpoints using the following command,
python apps/plm/consolidate.py --ckpt <path to the saved checkpoints.>
After consoldating the checkpoints, you can run inference using the following command,
python apps/plm/generate.py \
--ckpt facebook/Perception-LM-3B \
--media_type image \ # Replace with "video" for running inference on video
--media_path <path to image or video> \
--question <Question to be asked about the video.>
For evaluation, please refer to evaluation.md
.
We also provide a script to launch a distributed multinode training on slurm. Please use the provided utility named stool.py
.
python -m core.stool script=apps.plm.train config=apps/plm/configs/finetune/plm_8b.yaml qos=<QoS> nodes=<num_of_nodes>
We provide a step-by-step example for how to finetune PLM on a public dataset that elaborates on each of the steps above in detail. Please see finetune_example.md
.