Support Kosmos-2.5 #31711

tic-top · 2024-06-29T15:48:17Z

What does this PR do?

#30877 Implementation of Kosmos-2.5 in transformers.
https://huggingface.co/kirp/kosmos2_5/blob/main/README.md

Usage

from PIL import Image
import requests
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq, AutoConfig
import re

repo = "kirp/kosmos2_5"
device = "cuda:0"
config = AutoConfig.from_pretrained(repo)

NAME = {
    "f" : "flash_attention_2",
    "s" : "sdpa",
    "e" : "eager",
}

# all sdpa fp16
dtype = torch.float16
config._attn_implementation = NAME["s"]
config.vision_config._attn_implementation = NAME["s"]
config.text_config._attn_implementation = NAME["s"]

# # all sdpa fp16
# dtype = torch.float16
# config._attn_implementation = NAME["s"]
# config.text_config._attn_implementation = NAME["s"]
# config.vision_config._attn_implementation = NAME["s"]

# # all eager bf16
# dtype = torch.bfloat16
# config._attn_implementation = NAME["e"]
# config.text_config._attn_implementation = NAME["e"]
# config.vision_config._attn_implementation = NAME["e"]


model = AutoModelForVision2Seq.from_pretrained(repo, device_map = device, torch_dtype=dtype, config=config)
processor = AutoProcessor.from_pretrained(repo)

url = "https://huggingface.co/kirp/kosmos2_5/resolve/main/receipt_00008.png"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "<ocr>" # <md>

inputs = processor(text=prompt, images=image, return_tensors="pt")
height, width = inputs.pop("height"), inputs.pop("width")
raw_width, raw_height = image.size
scale_height = raw_height / height
scale_width = raw_width / width

inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=1024,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)

def postprocess(y, scale_height, scale_width):
    y = y.replace(prompt, "")
    if "<md>" in prompt:
        return y
    pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>"
    bboxs_raw = re.findall(pattern, y)
    lines = re.split(pattern, y)[1:]
    bboxs = [re.findall(r"\d+", i) for i in bboxs_raw]
    bboxs = [[int(j) for j in i] for i in bboxs]
    info = ""
    for i in range(len(lines)):
        box = bboxs[i]
        x0, y0, x1, y1 = box
        if not (x0 >= x1 or y0 >= y1):
            x0 = int(x0 * scale_width)
            y0 = int(y0 * scale_height)
            x1 = int(x1 * scale_width)
            y1 = int(y1 * scale_height)
            info += f"{x0},{y0},{x1},{y0},{x1},{y1},{x0},{y1},{lines[i]}"
    return info

output_text = postprocess(generated_text[0], scale_height, scale_width)
print(output_text)

amyeroberts · 2024-07-01T10:10:04Z

cc @ydshieh

ydshieh · 2024-07-09T15:01:05Z

Thanks a lot for this hard work @tic-top. This is going to benefit the community 🤗! Check this tomorrow!

ydshieh · 2024-07-10T15:32:46Z

I will push an empty commit to trigger a CI running on GPU 🙏

ydshieh

Hi. Apologized for being late.

My left one quick question and I will focus on reviewing this PR tomorrow.

src/transformers/models/kosmos2_5/__init__.py

src/transformers/models/kosmos2_5/image_processing_kosmos2_5.py

ydshieh · 2024-07-16T16:08:48Z

No worry about the failing CI above. It's fine, and I checked the tests running on a A10 that is ✅

.circleci/create_circleci_config.py

.circleci/parse_test_outputs.py

ydshieh · 2024-07-17T13:59:39Z

.github/workflows/self-pr-slow-ci.yml

(for me): I will revert this

benchmark/benchmark.py

docs/source/en/model_doc/kosmos-2.5.md

ydshieh · 2024-07-17T14:09:13Z

docs/source/en/model_doc/kosmos-2.5.md

+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos_2_overview.jpg"
+alt="drawing" width="600"/>
+
+<small> Overview of tasks that KOSMOS-2 can handle. Taken from the <a href="https://arxiv.org/abs/2306.14824">original paper</a>. </small>


If we don't keep the above line image, we should remove this too. Otherwise KOSMOS-2 -> KOSMOS-2.5

ydshieh · 2024-07-17T14:14:07Z

docs/source/en/model_doc/kosmos-2.5.md

+from PIL import Image
+import requests
+import torch
+from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration
+import re
+repo = "microsoft/kosmos-2.5"
+device = "cuda:0"
+dtype = torch.bfloat16
+model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=device, torch_dtype=dtype)
+processor = AutoProcessor.from_pretrained(repo)
+url = "https://huggingface.co/kirp/kosmos2_5/resolve/main/receipt_00008.png"
+image = Image.open(requests.get(url, stream=True).raw)
+prompt = "<ocr>" # <md>
+inputs = processor(text=prompt, images=image, return_tensors="pt")
+height, width = inputs.pop("height"), inputs.pop("width")
+raw_width, raw_height = image.size
+scale_height = raw_height / height
+scale_width = raw_width / width
+inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()}
+inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)
+generated_ids = model.generate(
+    **inputs,
+    max_new_tokens=1024,
+)
+generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
+def postprocess(y, scale_height, scale_width):
+    y = y.replace(prompt, "")
+    if "<md>" in prompt:
+        return y
+    pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>"
+    bboxs_raw = re.findall(pattern, y)
+    lines = re.split(pattern, y)[1:]
+    bboxs = [re.findall(r"\d+", i) for i in bboxs_raw]
+    bboxs = [[int(j) for j in i] for i in bboxs]
+    info = ""
+    for i in range(len(lines)):
+        box = bboxs[i]
+        x0, y0, x1, y1 = box
+        if not (x0 >= x1 or y0 >= y1):
+            x0 = int(x0 * scale_width)
+            y0 = int(y0 * scale_height)
+            x1 = int(x1 * scale_width)
+            y1 = int(y1 * scale_height)
+            info += f"{x0},{y0},{x1},{y0},{x1},{y1},{x0},{y1},{lines[i]}"
+    return info
+output_text = postprocess(generated_text[0], scale_height, scale_width)
+print(output_text)


nit: (not necessary)

Might be nice / interesting to refer to

https://github.com/microsoft/unilm/blob/master/kosmos-2.5/draw_bbox.py

and attach a screenshot of the output images.

This one is easier to use.
The python file above need to convert the str to json first, then draw.

Yes. But if I understand correctly, this only gives the string, but people are more interested to see the final images with bounding boxes or the structured MD layout.

I am not saying to use draw_bbox.py in this documentation. Just mention that there is a such file to draw things and give the link as a reference.

If you have any consideration not to mention, I am OK not to have it here.

Do you mean something like How to use?

scripts/benchmark/trainer-benchmark.py

tests/models/kosmos2_5/test_modeling_kosmos2_5.py

ydshieh · 2024-07-17T15:08:22Z

tests/models/kosmos2_5/test_modeling_kosmos2_5.py

+            seqlen = self.model_tester.text_model_tester.seq_length
+            inputs_dict["input_ids"] = torch.arange(seqlen, device=torch_device).unsqueeze(0).expand(bs, seqlen)
+            inputs_dict["input_ids"] = inputs_dict["input_ids"] % self.model_tester.text_model_tester.vocab_size
+            inputs_dict["attention_mask"] = torch.ones((bs, seqlen), device=torch_device)
+            inputs_dict["image_embeds_position_mask"] = torch.zeros((bs, seqlen), device=torch_device)
+            inputs_dict["image_embeds_position_mask"][:, : self.model_tester.latent_query_num] = 1


Is this necessary only to adjust the batch size? I think the default value will give the same batch size.

(For Kosmos-2, I don't need this extra block to adjust anything)

When I test it, it returns unexpected result.

ydshieh · 2024-07-17T15:22:57Z

tests/models/kosmos2_5/test_modeling_kosmos2_5.py

+class Kosmos2_5ModelIntegrationTest(unittest.TestCase):
+    def run_example(self, prompt, image, model, processor):
+        print("Prompt:", prompt)
+        inputs = processor(text=prompt, images=image, return_tensors="pt")
+        _, _ = inputs.pop("height"), inputs.pop("width")
+        inputs = {k: v.to(torch_device) if v is not None else None for k, v in inputs.items()}
+        inputs["flattened_patches"] = inputs["flattened_patches"].to(model.dtype)
+
+        generation_outputs = model.generate(
+            **inputs,
+            max_new_tokens=1024,
+        )
+        generated_ids = generation_outputs
+        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
+
+        return generated_ids, generated_text
+
+    def test_receipt_image_ocr(self):
+        url = "https://huggingface.co/microsoft/kosmos-2.5/resolve/main/receipt_00008.png"
+        url = "https://huggingface.co/kirp/kosmos2_5/resolve/main/receipt_00008.png"
+        image = Image.open(requests.get(url, stream=True).raw)
+
+        dtype = torch.bfloat16
+        repo = "microsoft/kosmos-2.5"
+        model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=torch_device, torch_dtype=dtype)
+        processor = AutoProcessor.from_pretrained(repo)
+        prompt = "<ocr>"
+        generated_ids, generated_text = self.run_example(prompt, image, model, processor)
+
+        EXPECTED_TEXT = [
+            "<ocr><bbox><x_53><y_573><x_69><y_606></bbox>1\n<bbox><x_79><y_573><x_464><y_611></bbox>[REG] BLACK SAKURA\n<bbox><x_690><y_569><x_810><y_606></bbox>45,455\n<bbox><x_53><y_614><x_69><y_648></bbox>1\n<bbox><x_79><y_614><x_468><y_650></bbox>COOKIE DOH SAUCES\n<bbox><x_788><y_609><x_812><y_644></bbox>0\n<bbox><x_50><y_658><x_69><y_693></bbox>1\n<bbox><x_79><y_658><x_358><y_693></bbox>NATA DE COCO\n<bbox><x_790><y_652><x_814><y_687></bbox>0\n<bbox><x_31><y_742><x_820><y_781></bbox>Sub Total 45,455\n<bbox><x_27><y_781><x_822><y_827></bbox>PB1 (10%) 4,545\n<bbox><x_27><y_826><x_824><y_872></bbox>Rounding 0\n<bbox><x_24><y_872><x_827><y_921></bbox>Total 50,000\n<bbox><x_17><y_1056><x_836><y_1108></bbox>Card Payment 50,000\n"
+        ]
+
+        self.assertListEqual(generated_text, EXPECTED_TEXT)
+
+    def test_receipt_image_md(self):
+        url = "https://huggingface.co/microsoft/kosmos-2.5/resolve/main/receipt_00008.png"
+        url = "https://huggingface.co/kirp/kosmos2_5/resolve/main/receipt_00008.png"
+        image = Image.open(requests.get(url, stream=True).raw)
+
+        dtype = torch.bfloat16
+        repo = "microsoft/kosmos-2.5"
+        model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=torch_device, torch_dtype=dtype)
+        processor = AutoProcessor.from_pretrained(repo)
+        prompt = "<md>"
+        generated_ids, generated_text = self.run_example(prompt, image, model, processor)
+        print(generated_text)
+        EXPECTED_TEXT = [
+            "<md>- **1 \\[REG\\] BLACK SAKURA** 45,455\n- **1 COOKIE DOH SAUCES** 0\n- **1 NATA DE COCO** 0\n- **Sub Total** 45,455\n- **PB1 (10%)** 4,545\n- **Rounding** 0\n- **Total** **50,000**\n\nCard Payment 50,000"
+        ]
+        self.assertListEqual(generated_text, EXPECTED_TEXT)


Since we skip some SDPA/Flash attention tests above 🙏 ?

Could you have them in the integration tests? The tests are likely identical, but just using Pure/SDPA/Flash attention masks.

Thank you. Since our CI is still using T4 runner, and non of these 3 tests pass (GPU OOM), I am thinking to reduce max_new_tokens=1024, to something smaller.

Do you have any comment about this?

If you think that makes sense too, I can run on T4 and update the expected output values on my own side

well, it turns out that GPU OOM happens already at vision_model_output = self.vision_model. So reducing max_new_tokens won't work here. Forget about my above comment.

ydshieh · 2024-07-17T15:30:54Z

tests/models/kosmos2_5/test_processor_kosmos2_5.py

+@require_vision
+class LlavaProcessorTest(unittest.TestCase):
+    def test_can_load_various_tokenizers(self):
+        # for checkpoint in ["microsoft/kosmos-2.5", "microsoft/kosmos-2.5"]:
+        for checkpoint in ["kirp/kosmos2_5"]:
+            processor = AutoProcessor.from_pretrained(checkpoint)
+            tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+            self.assertEqual(processor.tokenizer.__class__, tokenizer.__class__)


This should be similar to Kosmos2ProcessorTest.

The most important is test_full_processor.

I did a very extensive tests there, but you can keep it simple.

ydshieh

Hi @tic-top It looks very good! A few nits comments + a few things for the tests.

I will have to continue to finalize the review tomorrow but I submit what I have so far.

One important point to me is about the comment I left for the model Integration tests.

ydshieh · 2024-07-17T15:38:34Z

Ah, I forgot. We should also have a test file (class Kosmos2_5ImageProcessorTest) for class Kosmos2_5ImageProcessor
(similar to Pix2StructImageProcessingTest for Pix2StructImageProcessor)

It should be easy: just copy Pix2StructImageProcessingTest and only a few changes to do.

(But for this, you can wait - I still have a few things to review tomorrow)

docs/source/en/model_doc/kosmos-2.5.md

tic-top · 2024-07-22T13:25:32Z

docs/source/en/model_doc/kosmos-2.5.md

+from PIL import Image
+import requests
+import torch
+from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration
+import re
+repo = "microsoft/kosmos-2.5"
+device = "cuda:0"
+dtype = torch.bfloat16
+model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=device, torch_dtype=dtype)
+processor = AutoProcessor.from_pretrained(repo)
+url = "https://huggingface.co/kirp/kosmos2_5/resolve/main/receipt_00008.png"
+image = Image.open(requests.get(url, stream=True).raw)
+prompt = "<ocr>" # <md>
+inputs = processor(text=prompt, images=image, return_tensors="pt")
+height, width = inputs.pop("height"), inputs.pop("width")
+raw_width, raw_height = image.size
+scale_height = raw_height / height
+scale_width = raw_width / width
+inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()}
+inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)
+generated_ids = model.generate(
+    **inputs,
+    max_new_tokens=1024,
+)
+generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
+def postprocess(y, scale_height, scale_width):
+    y = y.replace(prompt, "")
+    if "<md>" in prompt:
+        return y
+    pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>"
+    bboxs_raw = re.findall(pattern, y)
+    lines = re.split(pattern, y)[1:]
+    bboxs = [re.findall(r"\d+", i) for i in bboxs_raw]
+    bboxs = [[int(j) for j in i] for i in bboxs]
+    info = ""
+    for i in range(len(lines)):
+        box = bboxs[i]
+        x0, y0, x1, y1 = box
+        if not (x0 >= x1 or y0 >= y1):
+            x0 = int(x0 * scale_width)
+            y0 = int(y0 * scale_height)
+            x1 = int(x1 * scale_width)
+            y1 = int(y1 * scale_height)
+            info += f"{x0},{y0},{x1},{y0},{x1},{y1},{x0},{y1},{lines[i]}"
+    return info
+output_text = postprocess(generated_text[0], scale_height, scale_width)
+print(output_text)


This one is easier to use.
The python file above need to convert the str to json first, then draw.

tic-top · 2024-07-22T13:25:50Z

src/transformers/__init__.py

@@ -1149,6 +1154,7 @@
    _import_structure["models.idefics2"].extend(["Idefics2ImageProcessor"])
    _import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
    _import_structure["models.instructblipvideo"].extend(["InstructBlipVideoImageProcessor"])
+    _import_structure["models.kosmos2_5"].extend(["Kosmos2_5ImageProcessor", "Kosmos2_5Processor"])


Where should I add it?

tic-top · 2024-07-22T13:25:56Z

src/transformers/__init__.py

@@ -5821,6 +5839,7 @@
        from .models.idefics2 import Idefics2ImageProcessor
        from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
        from .models.instructblipvideo import InstructBlipVideoImageProcessor
+        from .models.kosmos2_5 import Kosmos2_5ImageProcessor, Kosmos2_5Processor


tic-top · 2024-07-22T14:38:49Z

tests/models/kosmos2_5/test_modeling_kosmos2_5.py

+            seqlen = self.model_tester.text_model_tester.seq_length
+            inputs_dict["input_ids"] = torch.arange(seqlen, device=torch_device).unsqueeze(0).expand(bs, seqlen)
+            inputs_dict["input_ids"] = inputs_dict["input_ids"] % self.model_tester.text_model_tester.vocab_size
+            inputs_dict["attention_mask"] = torch.ones((bs, seqlen), device=torch_device)
+            inputs_dict["image_embeds_position_mask"] = torch.zeros((bs, seqlen), device=torch_device)
+            inputs_dict["image_embeds_position_mask"][:, : self.model_tester.latent_query_num] = 1


When I test it, it returns unexpected result.

tic-top · 2024-07-22T14:38:57Z

tests/models/kosmos2_5/test_processor_kosmos2_5.py

+@require_vision
+class LlavaProcessorTest(unittest.TestCase):
+    def test_can_load_various_tokenizers(self):
+        # for checkpoint in ["microsoft/kosmos-2.5", "microsoft/kosmos-2.5"]:
+        for checkpoint in ["kirp/kosmos2_5"]:
+            processor = AutoProcessor.from_pretrained(checkpoint)
+            tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+            self.assertEqual(processor.tokenizer.__class__, tokenizer.__class__)


tic-top · 2024-07-22T15:29:47Z

tests/models/kosmos2_5/test_modeling_kosmos2_5.py

+class Kosmos2_5ModelIntegrationTest(unittest.TestCase):
+    def run_example(self, prompt, image, model, processor):
+        print("Prompt:", prompt)
+        inputs = processor(text=prompt, images=image, return_tensors="pt")
+        _, _ = inputs.pop("height"), inputs.pop("width")
+        inputs = {k: v.to(torch_device) if v is not None else None for k, v in inputs.items()}
+        inputs["flattened_patches"] = inputs["flattened_patches"].to(model.dtype)
+
+        generation_outputs = model.generate(
+            **inputs,
+            max_new_tokens=1024,
+        )
+        generated_ids = generation_outputs
+        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
+
+        return generated_ids, generated_text
+
+    def test_receipt_image_ocr(self):
+        url = "https://huggingface.co/microsoft/kosmos-2.5/resolve/main/receipt_00008.png"
+        url = "https://huggingface.co/kirp/kosmos2_5/resolve/main/receipt_00008.png"
+        image = Image.open(requests.get(url, stream=True).raw)
+
+        dtype = torch.bfloat16
+        repo = "microsoft/kosmos-2.5"
+        model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=torch_device, torch_dtype=dtype)
+        processor = AutoProcessor.from_pretrained(repo)
+        prompt = "<ocr>"
+        generated_ids, generated_text = self.run_example(prompt, image, model, processor)
+
+        EXPECTED_TEXT = [
+            "<ocr><bbox><x_53><y_573><x_69><y_606></bbox>1\n<bbox><x_79><y_573><x_464><y_611></bbox>[REG] BLACK SAKURA\n<bbox><x_690><y_569><x_810><y_606></bbox>45,455\n<bbox><x_53><y_614><x_69><y_648></bbox>1\n<bbox><x_79><y_614><x_468><y_650></bbox>COOKIE DOH SAUCES\n<bbox><x_788><y_609><x_812><y_644></bbox>0\n<bbox><x_50><y_658><x_69><y_693></bbox>1\n<bbox><x_79><y_658><x_358><y_693></bbox>NATA DE COCO\n<bbox><x_790><y_652><x_814><y_687></bbox>0\n<bbox><x_31><y_742><x_820><y_781></bbox>Sub Total 45,455\n<bbox><x_27><y_781><x_822><y_827></bbox>PB1 (10%) 4,545\n<bbox><x_27><y_826><x_824><y_872></bbox>Rounding 0\n<bbox><x_24><y_872><x_827><y_921></bbox>Total 50,000\n<bbox><x_17><y_1056><x_836><y_1108></bbox>Card Payment 50,000\n"
+        ]
+
+        self.assertListEqual(generated_text, EXPECTED_TEXT)
+
+    def test_receipt_image_md(self):
+        url = "https://huggingface.co/microsoft/kosmos-2.5/resolve/main/receipt_00008.png"
+        url = "https://huggingface.co/kirp/kosmos2_5/resolve/main/receipt_00008.png"
+        image = Image.open(requests.get(url, stream=True).raw)
+
+        dtype = torch.bfloat16
+        repo = "microsoft/kosmos-2.5"
+        model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=torch_device, torch_dtype=dtype)
+        processor = AutoProcessor.from_pretrained(repo)
+        prompt = "<md>"
+        generated_ids, generated_text = self.run_example(prompt, image, model, processor)
+        print(generated_text)
+        EXPECTED_TEXT = [
+            "<md>- **1 \\[REG\\] BLACK SAKURA** 45,455\n- **1 COOKIE DOH SAUCES** 0\n- **1 NATA DE COCO** 0\n- **Sub Total** 45,455\n- **PB1 (10%)** 4,545\n- **Rounding** 0\n- **Total** **50,000**\n\nCard Payment 50,000"
+        ]
+        self.assertListEqual(generated_text, EXPECTED_TEXT)


tic-top

Where should I add processor?

ydshieh · 2024-07-25T16:05:26Z

tests/models/kosmos2_5/test_modeling_kosmos2_5.py

+        repo = "microsoft/kosmos-2.5"
+        model = Kosmos2_5ForConditionalGeneration.from_pretrained(
+            repo, device_map=torch_device, torch_dtype=dtype
+        )  # , attn_implementation="eager")


when not specified, and sdpa is available, it will actually use sdpa. Hence we have to specify "eager" in order to test "eager"

I pushed a commit for this.

ydshieh · 2024-07-25T16:06:56Z

tests/models/kosmos2_5/test_modeling_kosmos2_5.py

+        ]
+        self.assertListEqual(generated_text, EXPECTED_TEXT)
+
+    def test_FA2(self):


Need

@require_flash_attn @require_torch_gpu @pytest.mark.flash_attn_test @slow

(before def test_FA2(self):)

I pushed the change for this.

ydshieh

Here are some more comments. I have to stop in the middle as there is one thing I want to mention in particular.

src/transformers/models/kosmos2_5/configuration_kosmos2_5.py

src/transformers/models/kosmos2_5/image_processing_kosmos2_5.py

src/transformers/models/kosmos2_5/processing_kosmos2_5.py

ydshieh · 2024-07-26T09:26:18Z

src/transformers/models/kosmos2_5/processing_kosmos2_5.py

+            )
+
+            batch_size, seq_len = input.input_ids.shape
+            additional_tokens = [0, 100283] + [0] * 2048 + [100284]


Better to have a name of such special tokens, i.e. what 100283 and 100284 mean.
And the value should not be hardcoded. Something like

boi_token_id = tokenizer(boi_token) eoi_token_id = tokenizer(eoi_token) additional_tokens = [0, boi_token_id ] + [0] * 2048 + [eoi_token_id ]

would be fine (adjust the names of course)

Also is it possible to replace 2048 with something not hardcoded?

2048 is fixed

ydshieh

Thank you for this work @tic-top ! I am sure the community would be happy to have this available!

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

ydshieh · 2025-05-07T10:22:14Z

run-slow: kosmos2_5

github-actions · 2025-05-07T10:23:39Z

This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs:

models: ['models/kosmos2_5']
quantizations: [] ...

ydshieh · 2025-05-07T10:50:47Z

run-slow: kosmos2_5

github-actions · 2025-05-07T10:52:05Z

This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs:

models: ['models/kosmos2_5']
quantizations: [] ...

ArthurZucker

Thanks! We need to use a XXCrossAttention layer, a Basemodel (at the cost of not exposing everything) otherwise good to go

src/transformers/models/kosmos2_5/configuration_kosmos2_5.py

src/transformers/models/kosmos2_5/convert_kosmos2_5.py

src/transformers/models/kosmos2_5/image_processing_kosmos2_5.py

ArthurZucker · 2025-05-08T10:05:07Z

src/transformers/models/kosmos2_5/modeling_kosmos2_5.py

+
+        # use encoder_hidden_states if cross attention
+        is_cross_attention = encoder_hidden_states is not None
+        current_states = encoder_hidden_states if is_cross_attention else hidden_states


weird part about this is that you end upa always computing the cross k and v, never using the cache which is not standard !

ArthurZucker · 2025-05-08T10:07:12Z

src/transformers/models/kosmos2_5/modeling_kosmos2_5.py

+            past_key_value=None,
+            attention_mask=None,
+            output_attentions=None,


given that the past is not used, I would be in favor of having a Kosmos2_5CrossAttention class that shows this

src/transformers/models/kosmos2_5/modeling_kosmos2_5.py

ArthurZucker · 2025-05-08T10:11:42Z

docs/source/en/model_doc/kosmos2_5.md

+
+**OCR Task:** For usage instructions, please refer to [ocr.py](https://huggingface.co/microsoft/kosmos-2.5/blob/main/ocr.py).
+
+


it is missing some snippets about how to use for example extra bboxes and use post processor to plot boxes on the image

ydshieh · 2025-05-08T10:49:00Z

Thank you @ArthurZucker for the review.

XXCrossAttention layer, a Basemodel

At this stage, I tried to not changing the modeling code that would require the changes in model weights and to be recreate/re-uploaded.

Currently, the cross attention in Kosmos-2.5 just reuse the TextAttention with is_causal=False.

weird part about this is that you end upa always computing the cross k and v, never using the cache which is not standard !

During generate, the image_to_text_projection (and vision_model) computes only once under if image_embeds is None, so it is OK.

I agree that use TextAttention here is a bit strange. I will try to use XXCrossAttention (assuming no model weights changes) 🙏

ydshieh · 2025-05-08T10:51:08Z

Regarding

#31711 (comment)

@zucchini-nlp is this something we discussed before? Could one of you share some more details to me that what I could do? There is already a base model Kosmos2_5Model, but Kosmos2_5ForConditionalGeneration is not in standard form that @zucchini-nlp mentioned to me before, but it will require model weights changes 😢

zucchini-nlp · 2025-05-08T10:59:19Z

Yes, I would love to have it in standard form if possible. For vLLM it is not a blocker, because we are anyway messing with state dict keys by replacing prefixes, and Kosmos-2.5 will be one of the replaced models

Internally we talked with Yih-Dar, converting models weights again might be painful and I am not sure if authors accept it (related to internal thread on new model addition with weight conversion). Therefore, I won't push much and happy as is given that PR is open since last year 💀

…ize, n_heads, q_len - 1, dim_per_head)`

…th, dim_per_head)

…self.inner_attn_ln`

…um_patches ⚠️

⚠️

⚠️ need hub repo. update

…epo. update

ydshieh self-assigned this Jul 1, 2024

ydshieh added the run-slow label Jul 1, 2024

ydshieh reviewed Jul 16, 2024

View reviewed changes

src/transformers/models/kosmos2_5/__init__.py Outdated Show resolved Hide resolved

src/transformers/models/kosmos2_5/image_processing_kosmos2_5.py Show resolved Hide resolved