Skip to content

Commit 0a16dea

Browse files
committed
Release LLaVA-v1.6
1 parent ac89962 commit 0a16dea

32 files changed

Lines changed: 850 additions & 2372 deletions

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
*Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.*
44

5-
[[Project Page](https://llava-vl.github.io/)] [[Demo](https://llava.hliu.cc/)] [[Data](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md)] [[Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)]
5+
[📢 [LLaVA-1.6 Blog](https://llava-vl.github.io/blog/2024-01-30-llava-1-6/)] [[Project Page](https://llava-vl.github.io/)] [[Demo](https://llava.hliu.cc/)] [[Data](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md)] [[Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)]
66

77
🤝Community Contributions: [[llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436)] [[Colab](https://github.com/camenduru/LLaVA-colab)] [[🤗Space](https://huggingface.co/spaces/badayvedat/LLaVA)] [[Replicate](https://replicate.com/yorickvp/llava-13b)] [[AutoGen](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb)] [[BakLLaVA](https://github.com/SkunkworksAI/BakLLaVA)]
88

@@ -19,6 +19,7 @@
1919

2020

2121
## Release
22+
- [1/30] 🔥 LLaVA-1.6 is out! With additional scaling to LLaVA-1.5, LLaVA-1.6-34B outperforms Gemini Pro. It can now process 4x more pixels and perform more tasks/applications than before. Check out the [blog post](https://llava-vl.github.io/blog/2024-01-30-llava-1-6/), and explore the [demo](https://llava.hliu.cc/)! Models are available in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). Training/eval data and scripts coming soon.
2223
- [11/10] [LLaVA-Plus](https://llava-vl.github.io/llava-plus/) is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). [[Project Page](https://llava-vl.github.io/llava-plus/)] [[Demo](https://llavaplus.ngrok.io/)] [[Code](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase)] [[Paper](https://arxiv.org/abs/2311.05437)]
2324
- [11/2] [LLaVA-Interactive](https://llava-vl.github.io/llava-interactive/) is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. [[Project Page](https://llava-vl.github.io/llava-interactive/)] [[Demo](https://llavainteractive.ngrok.io/)] [[Code](https://github.com/LLaVA-VL/LLaVA-Interactive-Demo)] [[Paper](https://arxiv.org/abs/2311.00571)]
2425
- [10/26] 🔥 LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement ([ckpts](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md#llava-v15), [script](https://github.com/haotian-liu/LLaVA#train)). We also provide a [doc](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md) on how to finetune LLaVA-1.5 on your own dataset with LoRA.

docs/MODEL_ZOO.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,19 @@
44

55
If you are interested in including any other details in Model Zoo, please open an issue :)
66

7-
The model weights below are *merged* weights. You do not need to apply delta. The usage of LLaVA checkpoints should comply with the base LLM's model license: [Llama 2](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
7+
The model weights below are *merged* weights. You do not need to apply delta. The usage of LLaVA checkpoints should comply with the base LLM's model license.
8+
9+
## LLaVA-v1.6
10+
11+
| Version | LLM | Schedule | Checkpoint | MMMU | MathVista | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED-IMG | LLaVA-Bench-Wild | MM-Vet |
12+
|----------|----------|-----------|-----------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13+
| LLaVA-1.6 | Vicuna-7B | full_ft-1e | [liuhaotian/llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) | 35.8 | 34.6 | 81.8 | 64.2 | 57.6 | 70.1 | 64.9 | 86.5 | 1519/332 | 67.4 | 60.6 | 70.2 | 81.6 | 43.9 |
14+
| LLaVA-1.6 | Vicuna-13B | full_ft-1e | [liuhaotian/llava-v1.6-vicuna-13b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b) | 36.2 | 35.3 | 82.8 | 65.4 | 60.5 | 73.6 | 67.1 | 86.2 | 1575/326 | 70 | 64.4 | 71.9 | 87.3 | 48.4 |
15+
| LLaVA-1.6 | Mistral-7B | full_ft-1e | [liuhaotian/llava-v1.6-mistral-7b](https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b) | 35.3 | 37.7 | 82.2 | 64.8 | 60.0 | 72.8 | 65.7 | 86.7 | 1498/321 | 68.7 | 61.2 | 72.2 | 83.2 | 47.3 |
16+
| LLaVA-1.6 | Hermes-Yi-34B | full_ft-1e | [liuhaotian/llava-v1.6-34b](https://huggingface.co/liuhaotian/llava-v1.6-34b) | 51.1 | 46.5 | 83.7 | 67.1 | 63.8 | 81.8 | 69.5 | 87.7 | 1631/397 | 79.3 | 79 | 75.9 | 89.6 | 57.4 |
17+
18+
*LLaVA-1.6-34B outperforms Gemini Pro on benchmarks like MMMU and MathVista.*
19+
820

921
## LLaVA-v1.5
1022

llava/conversation.py

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ def get_prompt(self):
6868
else:
6969
ret += role
7070
elif self.sep_style == SeparatorStyle.LLAMA_2:
71-
wrap_sys = lambda msg: f"<<SYS>>\n{msg}\n<</SYS>>\n\n"
71+
wrap_sys = lambda msg: f"<<SYS>>\n{msg}\n<</SYS>>\n\n" if len(msg) > 0 else msg
7272
wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
7373
ret = ""
7474

@@ -357,13 +357,38 @@ def dict(self):
357357
version="v1_mmtag",
358358
)
359359

360+
conv_mistral_instruct = Conversation(
361+
system="",
362+
roles=("USER", "ASSISTANT"),
363+
version="llama_v2",
364+
messages=(),
365+
offset=0,
366+
sep_style=SeparatorStyle.LLAMA_2,
367+
sep="",
368+
sep2="</s>",
369+
)
370+
371+
conv_chatml_direct = Conversation(
372+
system="""<|im_start|>system
373+
Answer the questions.""",
374+
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
375+
version="mpt",
376+
messages=(),
377+
offset=0,
378+
sep_style=SeparatorStyle.MPT,
379+
sep="<|im_end|>",
380+
)
381+
360382
default_conversation = conv_vicuna_v1
361383
conv_templates = {
362384
"default": conv_vicuna_v0,
363385
"v0": conv_vicuna_v0,
364386
"v1": conv_vicuna_v1,
365387
"vicuna_v1": conv_vicuna_v1,
366388
"llama_2": conv_llama_2,
389+
"mistral_instruct": conv_mistral_instruct,
390+
"chatml_direct": conv_chatml_direct,
391+
"mistral_direct": conv_chatml_direct,
367392

368393
"plain": conv_llava_plain,
369394
"v0_plain": conv_llava_plain,

llava/eval/model_vqa.py

Lines changed: 5 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from llava.conversation import conv_templates, SeparatorStyle
1010
from llava.model.builder import load_pretrained_model
1111
from llava.utils import disable_torch_init
12-
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
12+
from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
1313

1414
from PIL import Image
1515
import math
@@ -55,17 +55,14 @@ def eval_model(args):
5555

5656
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
5757

58-
image = Image.open(os.path.join(args.image_folder, image_file))
59-
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
60-
61-
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
62-
keywords = [stop_str]
63-
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
58+
image = Image.open(os.path.join(args.image_folder, image_file)).convert('RGB')
59+
image_tensor = process_images([image], image_processor, model.config)[0]
6460

6561
with torch.inference_mode():
6662
output_ids = model.generate(
6763
input_ids,
6864
images=image_tensor.unsqueeze(0).half().cuda(),
65+
image_sizes=[image.size],
6966
do_sample=True if args.temperature > 0 else False,
7067
temperature=args.temperature,
7168
top_p=args.top_p,
@@ -74,15 +71,7 @@ def eval_model(args):
7471
max_new_tokens=1024,
7572
use_cache=True)
7673

77-
input_token_len = input_ids.shape[1]
78-
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
79-
if n_diff_input_output > 0:
80-
print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')
81-
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
82-
outputs = outputs.strip()
83-
if outputs.endswith(stop_str):
84-
outputs = outputs[:-len(stop_str)]
85-
outputs = outputs.strip()
74+
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
8675

8776
ans_id = shortuuid.uuid()
8877
ans_file.write(json.dumps({"question_id": idx,

llava/eval/model_vqa_loader.py

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -55,17 +55,24 @@ def __getitem__(self, index):
5555

5656
input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
5757

58-
return input_ids, image_tensor
58+
return input_ids, image_tensor, image.size
5959

6060
def __len__(self):
6161
return len(self.questions)
6262

6363

64+
def collate_fn(batch):
65+
input_ids, image_tensors, image_sizes = zip(*batch)
66+
input_ids = torch.stack(input_ids, dim=0)
67+
image_tensors = torch.stack(image_tensors, dim=0)
68+
return input_ids, image_tensors, image_sizes
69+
70+
6471
# DataLoader
6572
def create_data_loader(questions, image_folder, tokenizer, image_processor, model_config, batch_size=1, num_workers=4):
6673
assert batch_size == 1, "batch_size must be 1"
6774
dataset = CustomDataset(questions, image_folder, tokenizer, image_processor, model_config)
68-
data_loader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers, shuffle=False)
75+
data_loader = DataLoader(dataset, batch_size=batch_size, num_workers=num_workers, shuffle=False, collate_fn=collate_fn)
6976
return data_loader
7077

7178

@@ -88,7 +95,7 @@ def eval_model(args):
8895

8996
data_loader = create_data_loader(questions, args.image_folder, tokenizer, image_processor, model.config)
9097

91-
for (input_ids, image_tensor), line in tqdm(zip(data_loader, questions), total=len(questions)):
98+
for (input_ids, image_tensor, image_sizes), line in tqdm(zip(data_loader, questions), total=len(questions)):
9299
idx = line["question_id"]
93100
cur_prompt = line["text"]
94101

@@ -98,19 +105,15 @@ def eval_model(args):
98105
output_ids = model.generate(
99106
input_ids,
100107
images=image_tensor.to(dtype=torch.float16, device='cuda', non_blocking=True),
108+
image_sizes=image_sizes,
101109
do_sample=True if args.temperature > 0 else False,
102110
temperature=args.temperature,
103111
top_p=args.top_p,
104112
num_beams=args.num_beams,
105113
max_new_tokens=args.max_new_tokens,
106114
use_cache=True)
107115

108-
input_token_len = input_ids.shape[1]
109-
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
110-
if n_diff_input_output > 0:
111-
print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')
112-
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
113-
outputs = outputs.strip()
116+
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
114117

115118
ans_id = shortuuid.uuid()
116119
ans_file.write(json.dumps({"question_id": idx,

llava/eval/model_vqa_mmbench.py

Lines changed: 2 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -106,14 +106,12 @@ def eval_model(args):
106106
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
107107

108108
image_tensor = process_images([image], image_processor, model.config)[0]
109-
# image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
110-
111-
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
112109

113110
with torch.inference_mode():
114111
output_ids = model.generate(
115112
input_ids,
116113
images=image_tensor.unsqueeze(0).half().cuda(),
114+
image_sizes=[image.size],
117115
do_sample=True if args.temperature > 0 else False,
118116
temperature=args.temperature,
119117
top_p=args.top_p,
@@ -122,15 +120,7 @@ def eval_model(args):
122120
max_new_tokens=1024,
123121
use_cache=True)
124122

125-
input_token_len = input_ids.shape[1]
126-
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
127-
if n_diff_input_output > 0:
128-
print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')
129-
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
130-
outputs = outputs.strip()
131-
if outputs.endswith(stop_str):
132-
outputs = outputs[:-len(stop_str)]
133-
outputs = outputs.strip()
123+
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
134124

135125
ans_id = shortuuid.uuid()
136126
ans_file.write(json.dumps({"question_id": idx,

llava/eval/model_vqa_qbench.py

Lines changed: 0 additions & 122 deletions
This file was deleted.

0 commit comments

Comments
 (0)