|
| 1 | +### ScienceQA |
| 2 | + |
| 3 | +#### Prepare Data |
| 4 | +1. Please see ScienceQA [repo](https://github.com/lupantech/ScienceQA) for setting up the dataset. |
| 5 | +2. Generate ScienceQA dataset for LLaVA conversation-style format. |
| 6 | + |
| 7 | +```Shell |
| 8 | +python scripts/convert_sqa_to_llava \ |
| 9 | + convert_to_llava \ |
| 10 | + --base-dir /path/to/ScienceQA/data/scienceqa \ |
| 11 | + --split {train,val,minival,test,minitest} |
| 12 | +``` |
| 13 | + |
| 14 | +#### Training |
| 15 | +**NOTE**: Due to that ScienceQA experiments were done earlier, the current checkpoints are trained *without* `<im_start>` and `<im_end>` tokens. Here we provide our training scripts for the current checkpoints. |
| 16 | + |
| 17 | +<details> |
| 18 | +<summary>1. Pretraining</summary> |
| 19 | + |
| 20 | +```Shell |
| 21 | +torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \ |
| 22 | + llava/train/train_mem.py \ |
| 23 | + --model_name_or_path ./checkpoints/llama-vicuna-13b \ |
| 24 | + --data_path /path/to/cc3m_595k.json \ |
| 25 | + --image_folder /path/to/cc3m_595k \ |
| 26 | + --vision_tower openai/clip-vit-large-patch14 \ |
| 27 | + --tune_mm_mlp_adapter True \ |
| 28 | + --mm_vision_select_layer -2 \ |
| 29 | + --bf16 True \ |
| 30 | + --output_dir ./checkpoints/llava-13b-pretrain-no_im_start_end_token \ |
| 31 | + --num_train_epochs 1 \ |
| 32 | + --per_device_train_batch_size 16 \ |
| 33 | + --per_device_eval_batch_size 4 \ |
| 34 | + --gradient_accumulation_steps 1 \ |
| 35 | + --evaluation_strategy "no" \ |
| 36 | + --save_strategy "steps" \ |
| 37 | + --save_steps 2400 \ |
| 38 | + --save_total_limit 1 \ |
| 39 | + --learning_rate 2e-3 \ |
| 40 | + --weight_decay 0. \ |
| 41 | + --warmup_ratio 0.03 \ |
| 42 | + --lr_scheduler_type "cosine" \ |
| 43 | + --logging_steps 1 \ |
| 44 | + --tf32 True \ |
| 45 | + --model_max_length 2048 \ |
| 46 | + --gradient_checkpointing True \ |
| 47 | + --lazy_preprocess True \ |
| 48 | + --report_to wandb |
| 49 | +``` |
| 50 | +</details> |
| 51 | + |
| 52 | +<details> |
| 53 | +<summary>2. Finetuning</summary> |
| 54 | + |
| 55 | +You may download our pretrained `llava-13b-v0-pretrain-no_im_start_end_token.bin` [here](https://huggingface.co/liuhaotian/LLaVA-13b-pretrain-projector-v0/blob/main/LLaVA-13b-pretrain-projector-v0-CC3M-595K-original_caption-no_im_token.bin). |
| 56 | + |
| 57 | +```Shell |
| 58 | +torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \ |
| 59 | + llava/train/train_mem.py \ |
| 60 | + --model_name_or_path /path/to/llama-vicuna-13b \ |
| 61 | + --data_path /path/to/scienceqa/llava_train_QCM-LEPA.json \ |
| 62 | + --image_folder /path/to/scienceqa/images/train \ |
| 63 | + --vision_tower openai/clip-vit-large-patch14 \ |
| 64 | + --pretrain_mm_mlp_adapter ./checkpoints/llava-13b-pretrain-no_im_start_end_token/mm_projector.bin \ |
| 65 | + --mm_vision_select_layer -2 \ |
| 66 | + --bf16 True \ |
| 67 | + --output_dir ./checkpoints/llava-13b-pretrain-no_im_start_end_token-finetune_scienceqa \ |
| 68 | + --num_train_epochs 12 \ |
| 69 | + --per_device_train_batch_size 4 \ |
| 70 | + --per_device_eval_batch_size 4 \ |
| 71 | + --gradient_accumulation_steps 1 \ |
| 72 | + --evaluation_strategy "no" \ |
| 73 | + --save_strategy "steps" \ |
| 74 | + --save_steps 5000 \ |
| 75 | + --save_total_limit 3 \ |
| 76 | + --learning_rate 2e-5 \ |
| 77 | + --weight_decay 0. \ |
| 78 | + --warmup_ratio 0.03 \ |
| 79 | + --lr_scheduler_type "cosine" \ |
| 80 | + --logging_steps 1 \ |
| 81 | + --tf32 True \ |
| 82 | + --fsdp "full_shard auto_wrap" \ |
| 83 | + --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ |
| 84 | + --model_max_length 2048 \ |
| 85 | + --gradient_checkpointing True \ |
| 86 | + --lazy_preprocess True \ |
| 87 | + --report_to wandb |
| 88 | +``` |
| 89 | +</details> |
| 90 | + |
| 91 | +#### Evaluation |
| 92 | + |
| 93 | +1. Download our pretrained LLaVA-13B (delta) weights for ScienceQA dataset [here](https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0-science_qa). Convert the delta weights to actual weights. |
| 94 | + |
| 95 | +```Shell |
| 96 | +python -m llava.model.apply_delta \ |
| 97 | + --base /path/to/llama-13b \ |
| 98 | + --target /path/to/LLaVA-13b-v0-science_qa \ |
| 99 | + --delta liuhaotian/LLaVA-13b-delta-v0-science_qa |
| 100 | +``` |
| 101 | + |
| 102 | +2. [Option 1] Multiple-GPU inference |
| 103 | +You may evaluate this with multiple GPUs, and concatenate the generated jsonl files. Please refer to our script for [batch evaluation](https://github.com/haotian-liu/LLaVA/blob/main/scripts/sqa_eval_batch.sh) and [results gathering](https://github.com/haotian-liu/LLaVA/blob/main/scripts/sqa_eval_gather.sh). |
| 104 | + |
| 105 | +3. [Option 2] Single-GPU inference |
| 106 | + |
| 107 | +(a) Generate LLaVA responses on ScienceQA dataset |
| 108 | + |
| 109 | +```Shell |
| 110 | +python -m llava.eval.model_vqa_science \ |
| 111 | + --model-path /path/to/LLaVA-13b-v0-science_qa \ |
| 112 | + --question-file /path/to/ScienceQA/data/scienceqa/llava_test.json \ |
| 113 | + --image-folder /path/to/ScienceQA/data/scienceqa/images/test \ |
| 114 | + --answers-file vqa/results/ScienceQA/test_llava-13b.jsonl \ |
| 115 | + --answer-prompter \ |
| 116 | + --conv-mode llava_v0 |
| 117 | +``` |
| 118 | + |
| 119 | +(b) Evaluate the generated responses |
| 120 | + |
| 121 | +```Shell |
| 122 | +python eval_science_qa.py \ |
| 123 | + --base-dir /path/to/ScienceQA/data/scienceqa \ |
| 124 | + --result-file vqa/results/ScienceQA/test_llava-13b.jsonl \ |
| 125 | + --output-file vqa/results/ScienceQA/test_llava-13b_output.json \ |
| 126 | + --output-result vqa/results/ScienceQA/test_llava-13b_result.json \ |
| 127 | +``` |
| 128 | + |
| 129 | +For reference, we attach our prediction file [`test_llava-13b_result.json`](https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/table/results/test_sqa_llava_13b_v0.json) for comparison when reproducing our results, as well as for further analysis in detail. |
0 commit comments