Training Perception Language Model (PLM)

We provide instruction to train or finetune PLM on a custom dataset.

Tip

We provide configurations to run warm-up and sft to facilitate reproducibility of PLM training.

Data Format 📂

We use support both image and video conversation datasets using jsonl. Each line of jsonl file should follow the following format,

For Image Conversation Dataset

  {
    "image": "<image path>",
    "conversations": [
      {
        "from": "human",
        "value": "human instruction"
      },
      {
        "from": "assistant",
        "value": "model response"
      }
    ]
  }

For Video Conversation Dataset

  {
    "video": "<video path>",
    "conversations": [
      {
        "from": "human",
        "value": " human instruction"
      },
      {
        "from": "assistant",
        "value": "model response"
      }
    ]
  }

Note that for images, we require the image key to be present in the jsonl line, while for videos we require the video key to be present in the jsonl line. The conversations key is common between the two types.

Tip

The repo also support text-only, multi-image, image-region, video-region-caption (RCap), video-region-temporal-localization (RTLoc) and video-region-dense-captioning (RDCap) tasks. Please download the provided dummy-datasets for an example of each dataset.

Registration of New Dataset

Given the dataset jsonl file, we can register a new dataset by adding an entry in apps/plm/configs/datasets.yaml.

custom_dataset_name:
    annotation: path/to/the/jsonl/file.jsonl
    root_dir: path/to/the/image-or-video/root-dir

Please refer to apps/plm/configs/datasets.yaml for already present dummy image, video and grounding datasets.

Training / Finetuning PLM 🚋

Training PLM involves creating a .yaml configuration file, defining all model and training related configurable parameters. Please refer to the provided sft for details.

Tip

To run the following code, download the dummy-datasets and extract them to apps/plm/dummy_datasets.

Given a .yaml configuration file, please run the following command to launch the training on a single node with 8 GPUs.

torchrun --nproc-per-node 8 -m apps.plm.train config=apps/plm/configs/finetune/plm_3b.yaml

Consolidate Checkpoints

In order to run inference / evaluation, please consolidate checkpoints using the following command,

python apps/plm/consolidate.py --ckpt <path to the saved checkpoints.>

Run Inference / Evaluation

After consoldating the checkpoints, you can run inference using the following command,

python apps/plm/generate.py \
--ckpt facebook/Perception-LM-3B \
--media_type image \  # Replace with "video" for running inference on video
--media_path <path to image or video> \
--question <Question to be asked about the video.>

For evaluation, please refer to evaluation.md.

We also provide a script to launch a distributed multinode training on slurm. Please use the provided utility named stool.py.

python -m core.stool script=apps.plm.train config=apps/plm/configs/finetune/plm_8b.yaml qos=<QoS> nodes=<num_of_nodes>

We provide a step-by-step example for how to finetune PLM on a public dataset that elaborates on each of the steps above in detail. Please see finetune_example.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training.md

training.md

Training Perception Language Model (PLM)

Data Format 📂

For Image Conversation Dataset

For Video Conversation Dataset

Registration of New Dataset

Training / Finetuning PLM 🚋

Consolidate Checkpoints

Run Inference / Evaluation

Files

training.md

Latest commit

History

training.md

File metadata and controls

Training Perception Language Model (PLM)

Data Format 📂

For Image Conversation Dataset

For Video Conversation Dataset

Registration of New Dataset

Training / Finetuning PLM 🚋

Consolidate Checkpoints

Run Inference / Evaluation