Skip to content

Commit eae9369

Browse files
committed
Update docs
1 parent 3337088 commit eae9369

7 files changed

Lines changed: 207 additions & 324 deletions

File tree

README.md

Lines changed: 20 additions & 281 deletions
Large diffs are not rendered by default.

docs/ScienceQA.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
### ScienceQA
2+
3+
#### Prepare Data
4+
1. Please see ScienceQA [repo](https://github.com/lupantech/ScienceQA) for setting up the dataset.
5+
2. Generate ScienceQA dataset for LLaVA conversation-style format.
6+
7+
```Shell
8+
python scripts/convert_sqa_to_llava \
9+
convert_to_llava \
10+
--base-dir /path/to/ScienceQA/data/scienceqa \
11+
--split {train,val,minival,test,minitest}
12+
```
13+
14+
#### Training
15+
**NOTE**: Due to that ScienceQA experiments were done earlier, the current checkpoints are trained *without* `<im_start>` and `<im_end>` tokens. Here we provide our training scripts for the current checkpoints.
16+
17+
<details>
18+
<summary>1. Pretraining</summary>
19+
20+
```Shell
21+
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
22+
llava/train/train_mem.py \
23+
--model_name_or_path ./checkpoints/llama-vicuna-13b \
24+
--data_path /path/to/cc3m_595k.json \
25+
--image_folder /path/to/cc3m_595k \
26+
--vision_tower openai/clip-vit-large-patch14 \
27+
--tune_mm_mlp_adapter True \
28+
--mm_vision_select_layer -2 \
29+
--bf16 True \
30+
--output_dir ./checkpoints/llava-13b-pretrain-no_im_start_end_token \
31+
--num_train_epochs 1 \
32+
--per_device_train_batch_size 16 \
33+
--per_device_eval_batch_size 4 \
34+
--gradient_accumulation_steps 1 \
35+
--evaluation_strategy "no" \
36+
--save_strategy "steps" \
37+
--save_steps 2400 \
38+
--save_total_limit 1 \
39+
--learning_rate 2e-3 \
40+
--weight_decay 0. \
41+
--warmup_ratio 0.03 \
42+
--lr_scheduler_type "cosine" \
43+
--logging_steps 1 \
44+
--tf32 True \
45+
--model_max_length 2048 \
46+
--gradient_checkpointing True \
47+
--lazy_preprocess True \
48+
--report_to wandb
49+
```
50+
</details>
51+
52+
<details>
53+
<summary>2. Finetuning</summary>
54+
55+
You may download our pretrained `llava-13b-v0-pretrain-no_im_start_end_token.bin` [here](https://huggingface.co/liuhaotian/LLaVA-13b-pretrain-projector-v0/blob/main/LLaVA-13b-pretrain-projector-v0-CC3M-595K-original_caption-no_im_token.bin).
56+
57+
```Shell
58+
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
59+
llava/train/train_mem.py \
60+
--model_name_or_path /path/to/llama-vicuna-13b \
61+
--data_path /path/to/scienceqa/llava_train_QCM-LEPA.json \
62+
--image_folder /path/to/scienceqa/images/train \
63+
--vision_tower openai/clip-vit-large-patch14 \
64+
--pretrain_mm_mlp_adapter ./checkpoints/llava-13b-pretrain-no_im_start_end_token/mm_projector.bin \
65+
--mm_vision_select_layer -2 \
66+
--bf16 True \
67+
--output_dir ./checkpoints/llava-13b-pretrain-no_im_start_end_token-finetune_scienceqa \
68+
--num_train_epochs 12 \
69+
--per_device_train_batch_size 4 \
70+
--per_device_eval_batch_size 4 \
71+
--gradient_accumulation_steps 1 \
72+
--evaluation_strategy "no" \
73+
--save_strategy "steps" \
74+
--save_steps 5000 \
75+
--save_total_limit 3 \
76+
--learning_rate 2e-5 \
77+
--weight_decay 0. \
78+
--warmup_ratio 0.03 \
79+
--lr_scheduler_type "cosine" \
80+
--logging_steps 1 \
81+
--tf32 True \
82+
--fsdp "full_shard auto_wrap" \
83+
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
84+
--model_max_length 2048 \
85+
--gradient_checkpointing True \
86+
--lazy_preprocess True \
87+
--report_to wandb
88+
```
89+
</details>
90+
91+
#### Evaluation
92+
93+
1. Download our pretrained LLaVA-13B (delta) weights for ScienceQA dataset [here](https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0-science_qa). Convert the delta weights to actual weights.
94+
95+
```Shell
96+
python -m llava.model.apply_delta \
97+
--base /path/to/llama-13b \
98+
--target /path/to/LLaVA-13b-v0-science_qa \
99+
--delta liuhaotian/LLaVA-13b-delta-v0-science_qa
100+
```
101+
102+
2. [Option 1] Multiple-GPU inference
103+
You may evaluate this with multiple GPUs, and concatenate the generated jsonl files. Please refer to our script for [batch evaluation](https://github.com/haotian-liu/LLaVA/blob/main/scripts/sqa_eval_batch.sh) and [results gathering](https://github.com/haotian-liu/LLaVA/blob/main/scripts/sqa_eval_gather.sh).
104+
105+
3. [Option 2] Single-GPU inference
106+
107+
(a) Generate LLaVA responses on ScienceQA dataset
108+
109+
```Shell
110+
python -m llava.eval.model_vqa_science \
111+
--model-path /path/to/LLaVA-13b-v0-science_qa \
112+
--question-file /path/to/ScienceQA/data/scienceqa/llava_test.json \
113+
--image-folder /path/to/ScienceQA/data/scienceqa/images/test \
114+
--answers-file vqa/results/ScienceQA/test_llava-13b.jsonl \
115+
--answer-prompter \
116+
--conv-mode llava_v0
117+
```
118+
119+
(b) Evaluate the generated responses
120+
121+
```Shell
122+
python eval_science_qa.py \
123+
--base-dir /path/to/ScienceQA/data/scienceqa \
124+
--result-file vqa/results/ScienceQA/test_llava-13b.jsonl \
125+
--output-file vqa/results/ScienceQA/test_llava-13b_output.json \
126+
--output-result vqa/results/ScienceQA/test_llava-13b_result.json \
127+
```
128+
129+
For reference, we attach our prediction file [`test_llava-13b_result.json`](https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/table/results/test_sqa_llava_13b_v0.json) for comparison when reproducing our results, as well as for further analysis in detail.

images/demo_cli.gif

9.58 MB
Loading

llava/model/builder.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,12 @@ def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, l
2929
kwargs['load_in_8bit'] = True
3030
elif load_4bit:
3131
kwargs['load_in_4bit'] = True
32+
kwargs['quantization_config'] = BitsAndBytesConfig(
33+
load_in_4bit=True,
34+
bnb_4bit_compute_dtype=torch.float16,
35+
bnb_4bit_use_double_quant=True,
36+
bnb_4bit_quant_type='nf4'
37+
)
3238
else:
3339
kwargs['torch_dtype'] = torch.float16
3440

llava/serve/cli.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,11 @@ def main(args):
2929
disable_torch_init()
3030

3131
model_name = get_model_name_from_path(args.model_path)
32-
tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name)
32+
tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit)
3333

34-
if "v1" in model_name.lower():
34+
if 'llama-2' in model_name.lower():
35+
conv_mode = "llava_llama_2"
36+
elif "v1" in model_name.lower():
3537
conv_mode = "llava_v1"
3638
elif "mpt" in model_name.lower():
3739
conv_mode = "mpt"
@@ -107,10 +109,11 @@ def main(args):
107109
parser.add_argument("--model-base", type=str, default=None)
108110
parser.add_argument("--image-file", type=str, required=True)
109111
parser.add_argument("--num-gpus", type=int, default=1)
110-
parser.add_argument("--device", type=str, choices=["cuda", "cpu"], default="cuda")
111112
parser.add_argument("--conv-mode", type=str, default=None)
112113
parser.add_argument("--temperature", type=float, default=0.2)
113114
parser.add_argument("--max-new-tokens", type=int, default=512)
115+
parser.add_argument("--load-8bit", action="store_true")
116+
parser.add_argument("--load-4bit", action="store_true")
114117
parser.add_argument("--debug", action="store_true")
115118
args = parser.parse_args()
116119
main(args)

scripts/extract_mm_projector.py

Lines changed: 0 additions & 40 deletions
This file was deleted.

scripts/finetune_full_schedule.sh

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
#!/bin/bash
2+
3+
# Uncomment and set the following variables correspondingly to run this script:
4+
5+
################## VICUNA ##################
6+
# PROMPT_VERSION=v1
7+
# MODEL_VERSION="vicuna-v1-3-7b"
8+
################## VICUNA ##################
9+
10+
################## LLaMA-2 ##################
11+
# PROMPT_VERSION="llava_llama_2"
12+
# MODEL_VERSION="llama-2-7b-chat"
13+
################## LLaMA-2 ##################
14+
15+
deepspeed llava/train/train_mem.py \
16+
--deepspeed /path/to/deepspeed.json \
17+
--model_name_or_path ./checkpoints/$MODEL_VERSION \
18+
--version $PROMPT_VERSION \
19+
--data_path ./playground/data/llava_instruct_158k.json \
20+
--image_folder /path/to/coco/train2017 \
21+
--vision_tower openai/clip-vit-large-patch14 \
22+
--pretrain_mm_mlp_adapter ./checkpoints/llava-$MODEL_VERSION-pretrain/mm_projector.bin \
23+
--mm_vision_select_layer -2 \
24+
--mm_use_im_start_end False \
25+
--mm_use_im_patch_token False \
26+
--bf16 True \
27+
--output_dir ./checkpoints/llava-$MODEL_VERSION-finetune \
28+
--num_train_epochs 3 \
29+
--per_device_train_batch_size 16 \
30+
--per_device_eval_batch_size 4 \
31+
--gradient_accumulation_steps 1 \
32+
--evaluation_strategy "no" \
33+
--save_strategy "steps" \
34+
--save_steps 50000 \
35+
--save_total_limit 1 \
36+
--learning_rate 2e-5 \
37+
--weight_decay 0. \
38+
--warmup_ratio 0.03 \
39+
--lr_scheduler_type "cosine" \
40+
--logging_steps 1 \
41+
--tf32 True \
42+
--model_max_length 2048 \
43+
--gradient_checkpointing True \
44+
--dataloader_num_workers 4 \
45+
--lazy_preprocess True \
46+
--report_to wandb

0 commit comments

Comments
 (0)