Skip to content

Support Kosmos-2.5 #31711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 441 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
441 commits
Select commit Hold shift + click to select a range
eb116ab
[kirp] use string
Sep 2, 2024
e1ab413
[kirp] remove creating mask in the layer
Sep 2, 2024
fe418d0
[kirp] remove cache
Sep 2, 2024
cc7d28f
Revert "[kirp] remove creating mask in the layer"
Sep 2, 2024
e5ffaee
[kirp] fix typo in processor
Sep 3, 2024
b5ebf09
[kirp] remove head mask
Sep 3, 2024
dd12798
[kirp] remove test file
Sep 3, 2024
15feaea
[kirp] cache for eager
Sep 30, 2024
ab687f5
[kirp] sdpa cache
Sep 30, 2024
87ab935
[kirp] move attention_mask maker to vision encoder
Sep 30, 2024
54b1984
[kirp] cache sdpa and format
Sep 30, 2024
5e5a9e9
[kirp] fix format
Sep 30, 2024
0ed8541
[kirp] fix format
Sep 30, 2024
df9d3ad
[kirp] use update_causal_mask
Sep 30, 2024
55cb12d
[kirp] check copies
Sep 30, 2024
d99934d
[kirp] regroup the init
Sep 30, 2024
c705049
[kirp] make style
Sep 30, 2024
806ca1b
[run-slow] kosmos2_5
Sep 30, 2024
9e620b6
[run-slow] fix checkpoint bug
Sep 30, 2024
65490b4
[run-slow] fix checkpoint bug
Sep 30, 2024
d0bf57e
Merge remote-tracking branch 'upstream/main' into main
Oct 2, 2024
f5d4439
[run-slow] kosmos2_5
Oct 2, 2024
40ff015
[run-slow] kosmos2_5
Oct 2, 2024
63603d6
[kirp] remove cross_attn in textblock
tic-top Oct 10, 2024
f8497ce
[run-slow] kosmos2_5
tic-top Oct 10, 2024
eab8e69
[run-slow] kosmos2_5
tic-top Oct 10, 2024
a6154db
[run-slow] kosmos2_5
tic-top Oct 11, 2024
94cc6d2
[ydshieh] update loop
ydshieh Oct 22, 2024
968b033
[ydshieh] remove duplication in init file
ydshieh Oct 25, 2024
142604d
[ydshieh] tokenizer class
ydshieh Oct 25, 2024
4b7bc95
[ydshieh] remove copied from
ydshieh Oct 25, 2024
6f2bd73
[ydshieh] skip
ydshieh Oct 25, 2024
08e1cb0
[ydshieh] move
ydshieh Oct 29, 2024
f2dae0d
Merge branch 'main' into kosmos25
ydshieh Oct 29, 2024
fcc095f
[ydshieh] fix copie
ydshieh Oct 29, 2024
f66c6ee
[ydshieh] remove
ydshieh Oct 29, 2024
9a8479d
[ydshieh] Add to MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
ydshieh Oct 29, 2024
830671b
[ydshieh] new init
ydshieh Oct 29, 2024
1c58c8f
[ydshieh] fix
ydshieh Oct 29, 2024
0153a08
[ydshieh] remove
ydshieh Oct 31, 2024
ac94b57
[ydshieh] add ProcessorTesterMixin
ydshieh Oct 31, 2024
52788cc
[ydshieh] add GenerationTesterMixin
ydshieh Oct 31, 2024
0b9e5ad
Merge branch 'main' into kosmos25
ydshieh Dec 6, 2024
925e14a
Merge branch 'main' into main
ydshieh Dec 6, 2024
6ed504d
fix
ydshieh Dec 6, 2024
9a841ad
fix
ydshieh Dec 6, 2024
dcced48
fix
ydshieh Dec 13, 2024
91fa383
fix
ydshieh Dec 13, 2024
e3802f4
fix
ydshieh Dec 13, 2024
85da449
fix
ydshieh Dec 13, 2024
b1db4f2
fix
ydshieh Dec 13, 2024
f8c98d6
it's Friday night, let cross finger
ydshieh Dec 13, 2024
fbb3e59
it's Friday night, let cross finger
ydshieh Dec 13, 2024
ce3a6b0
it's Friday night, let cross finger
ydshieh Dec 13, 2024
90c4fcc
it's Friday night, let cross finger
ydshieh Dec 13, 2024
00e324d
it's Friday night, let cross finger
ydshieh Dec 13, 2024
9c8aff7
it's Friday night, let cross finger
ydshieh Dec 13, 2024
2c47915
it's Friday night, let cross finger
ydshieh Dec 13, 2024
395a636
it's Monday let's go
ydshieh Dec 16, 2024
8a058d9
it's Monday let's go
ydshieh Dec 16, 2024
c639eeb
it's Monday let's go
ydshieh Dec 16, 2024
b688c4f
Merge branch 'ca03842c' into kosmos25
ydshieh Dec 16, 2024
d1c52f4
temp
ydshieh Dec 17, 2024
3a58742
temp
ydshieh Dec 17, 2024
d5b8349
temp
ydshieh Dec 17, 2024
9ddc86b
temp
ydshieh Dec 17, 2024
39dc6ef
temp
ydshieh Dec 17, 2024
b2c3db2
temp
ydshieh Dec 17, 2024
c356a36
temp
ydshieh Dec 17, 2024
55944fc
temp
ydshieh Dec 17, 2024
83d600e
temp
ydshieh Dec 17, 2024
2d4cbba
temp
ydshieh Dec 17, 2024
6b2f7d7
temp
ydshieh Dec 17, 2024
5f731a9
temp
ydshieh Dec 17, 2024
0ec499a
temp
ydshieh Dec 17, 2024
7f0d26c
temp
ydshieh Dec 17, 2024
db865db
temp
ydshieh Dec 17, 2024
bf14c4b
temp
ydshieh Dec 17, 2024
9b29aac
temp
ydshieh Dec 17, 2024
ce222a6
temp
ydshieh Dec 17, 2024
876cb6b
temp
ydshieh Dec 17, 2024
a3638ea
temp
ydshieh Dec 17, 2024
30f927a
temp
ydshieh Dec 17, 2024
a65a9b1
temp
ydshieh Dec 17, 2024
7c99fd0
temp
ydshieh Dec 17, 2024
ec9ea0c
fix
ydshieh Dec 17, 2024
8fc9699
fix
ydshieh Dec 17, 2024
22cb70d
fix
ydshieh Dec 18, 2024
001fd70
fix
ydshieh Dec 18, 2024
d1116f5
fix
ydshieh Dec 18, 2024
6f09a51
fix
ydshieh Dec 18, 2024
7d0b827
Merge branch 'main' into main
ydshieh Dec 18, 2024
cd018b0
Merge branch 'main' into kosmos25
ydshieh Jan 10, 2025
a5b23f8
Merge branch 'temp' into kosmos25
ydshieh Jan 21, 2025
d1debcc
no more copied
ydshieh Jan 21, 2025
1279316
fix
ydshieh Jan 21, 2025
69aec2e
Apply suggestions from code review
ydshieh Jan 21, 2025
8c579a9
fix default values in docstrings
ydshieh Jan 21, 2025
af813ce
update doc
ydshieh Jan 21, 2025
ca60142
Merge branch 'main_b5aaf875' into kosmos25
ydshieh Jan 24, 2025
19da4a2
[update] Kosmos2_5TextTransformer.forward
ydshieh Jan 24, 2025
777a3e2
Update Kosmos2_5TextBlock.forward # Need to update `self.self_attn` i…
ydshieh Jan 24, 2025
1ace3d1
Don't return past_key_value # need further changes
ydshieh Jan 24, 2025
8d2e51f
fix import issues
ydshieh Jan 24, 2025
7b65626
Fix Kosmos2_5ImageToTextProjection.forward: remove `_` when calling `…
ydshieh Jan 24, 2025
89c6901
Add eager_attention_forward
ydshieh Jan 24, 2025
59700f9
temp. update KOSMOS2_5_TEXT_ATTENTION_CLASSES # Need to remove this v…
ydshieh Jan 24, 2025
4988c47
Add `self.config = config` to `Kosmos2_5TextAttention.__init__`
ydshieh Jan 24, 2025
3a411a3
fix: change self.attention_dropout to self.dropout
ydshieh Jan 24, 2025
4036920
fix: remove the redudant ` * self.scaling`
ydshieh Jan 24, 2025
59c21c9
debug: partial revert
ydshieh Jan 24, 2025
48c6965
ugly fix for numerical issue
ydshieh Jan 27, 2025
f36ef6f
back to the clean version with the scaling issue fixed
ydshieh Jan 27, 2025
d812476
fix missing comma
ydshieh Jan 27, 2025
f339e50
add comment about sdpa: currently some tests failing because we use e…
ydshieh Jan 27, 2025
f81256a
Merge branch 'main' into kosmos25
ydshieh Jan 31, 2025
beb281c
comment
ydshieh Jan 31, 2025
a6ff4d2
Use ALL_ATTENTION_FUNCTIONS in `Kosmos2_5TextAttention`
ydshieh Jan 31, 2025
fb62fd6
Remove other attn impl. and KOSMOS2_5_TEXT_ATTENTION_CLASSES
ydshieh Jan 31, 2025
9a90c54
Update Kosmos2_5ImageToTextProjection
ydshieh Jan 31, 2025
cf804ac
Remove output_attentions: bool = False
ydshieh Jan 31, 2025
8fc5a31
remove if not output_attentions:
ydshieh Jan 31, 2025
0593308
Deal with vision part
ydshieh Jan 31, 2025
6a55353
Fix scaling
ydshieh Jan 31, 2025
2050bc3
Merge branch 'main' into kosmos25
ydshieh Feb 3, 2025
9cdfdf3
clean up
ydshieh Feb 3, 2025
c49d565
remove test_torchscript
ydshieh Feb 3, 2025
a809295
remove test_torchscript = False
ydshieh Feb 3, 2025
95dc35c
✅✅✅ finally green CI
ydshieh Feb 3, 2025
bcaf808
ruff fix
ydshieh Feb 3, 2025
917dcc8
ruff format
ydshieh Feb 3, 2025
42c3216
remove copied
ydshieh Feb 3, 2025
cd35c34
ruff format
ydshieh Feb 3, 2025
185f370
Merge branch 'main' into kosmos25
ydshieh Feb 4, 2025
0c0c485
update
ydshieh Feb 5, 2025
0216ac8
Merge branch 'main' into main
ydshieh Feb 5, 2025
6b288c3
lm loss update
ydshieh Feb 7, 2025
94e563a
Merge branch 'main' into kosmos25
ydshieh Feb 11, 2025
ad6ded5
update
ydshieh Feb 11, 2025
17b78dd
add width and height
ydshieh Feb 12, 2025
b9fc031
remove pop in the test
ydshieh Feb 13, 2025
0ece5c7
remove from prepare_inputs_for_generation
ydshieh Feb 13, 2025
fba70ba
width and height for base model
ydshieh Feb 13, 2025
4281ff3
update docstring
ydshieh Feb 13, 2025
6e071c7
update docstring
ydshieh Feb 13, 2025
a5318da
update docstring
ydshieh Feb 13, 2025
ead54fd
style
ydshieh Feb 13, 2025
6c320a6
fix copie
ydshieh Feb 13, 2025
783877d
rename doc
ydshieh Feb 13, 2025
7399d8a
add fast image processor
ydshieh Feb 13, 2025
c01c60d
Merge branch 'main' into kosmos25
ydshieh Mar 24, 2025
902a030
Merge branch 'main' into main
ydshieh Mar 25, 2025
0323624
temp
ydshieh Mar 26, 2025
740386c
temp
ydshieh Mar 26, 2025
9744cb6
Merge branch 'main' into main
ydshieh Mar 26, 2025
19330d2
temp
ydshieh Mar 27, 2025
d5df504
temp
ydshieh Mar 27, 2025
5e3a2e6
Merge branch 'main' into main
ydshieh Mar 31, 2025
0798236
temp
ydshieh Mar 31, 2025
e835f82
temp
ydshieh Mar 31, 2025
b99e679
temp
ydshieh Mar 31, 2025
dd51797
temp
ydshieh Apr 2, 2025
958adb7
build images
ydshieh Apr 2, 2025
bd2083e
try
ydshieh Apr 2, 2025
9848655
Merge branch 'main' into kosmos25_2025_04_14_try_rebase
ydshieh Apr 14, 2025
d82f6af
exp
ydshieh Apr 14, 2025
77d0ea2
exp
ydshieh Apr 14, 2025
b9062f5
exp
ydshieh Apr 14, 2025
28a7c36
exp
ydshieh Apr 14, 2025
cc2b7bd
exp
ydshieh Apr 14, 2025
f2c2752
try
ydshieh Apr 14, 2025
5b72c43
batch feat
ydshieh Apr 15, 2025
42229b6
rescale_factor = None
ydshieh Apr 15, 2025
ac9c77e
overwrite tests
ydshieh Apr 15, 2025
d838cdc
overwrite tests
ydshieh Apr 15, 2025
fd1cad2
skip
ydshieh Apr 15, 2025
3e5b033
skip
ydshieh Apr 15, 2025
67a0b79
skip
ydshieh Apr 15, 2025
a7961cb
device
ydshieh Apr 15, 2025
2c44bd2
device
ydshieh Apr 15, 2025
bd4d18b
device
ydshieh Apr 15, 2025
5450db1
device
ydshieh Apr 15, 2025
e36ea15
device
ydshieh Apr 15, 2025
9831c1b
ruff format
ydshieh Apr 15, 2025
e2e9ed9
ruff check
ydshieh Apr 15, 2025
464d93e
ruff check unsafe
ydshieh Apr 15, 2025
c5981d0
trigger CI
ydshieh Apr 15, 2025
dd3f5ae
copy 1
ydshieh Apr 15, 2025
5df3ade
copy 2
ydshieh Apr 15, 2025
625d473
remove one Q
ydshieh Apr 16, 2025
98aa5a5
imports
ydshieh Apr 16, 2025
41f1f4c
Merge branch 'main' into kosmos25_2025_04_14_try_rebase
ydshieh Apr 22, 2025
44bc32e
fix init
ydshieh Apr 22, 2025
7f9ab92
can_return_tuple part 1
ydshieh Apr 22, 2025
d8e3d52
can_return_tuple part 2
ydshieh Apr 22, 2025
4e28b99
can_return_tuple part 3
ydshieh Apr 22, 2025
49b5d8d
can_return_tuple part 4
ydshieh Apr 22, 2025
e6b6969
can_return_tuple part 5
ydshieh Apr 22, 2025
439c1d7
can_return_tuple part 6
ydshieh Apr 22, 2025
61b3a2a
can_return_tuple part 7
ydshieh Apr 22, 2025
0ad75d7
update
ydshieh Apr 22, 2025
4baab28
update
ydshieh Apr 22, 2025
05a64fd
update
ydshieh Apr 22, 2025
afbe1d7
update
ydshieh Apr 22, 2025
05d65f7
skip test_prompt_lookup_decoding_matches_greedy_search
ydshieh Apr 22, 2025
3703cc4
Merge branch 'main' into kosmos25_2025_04_28
ydshieh Apr 28, 2025
00903d7
address comment 001: remove from_text_vision_configs
ydshieh Apr 28, 2025
2e6222c
address comment 002: fix indent in for loop of Kosmos2_5ImageProcesso…
ydshieh Apr 28, 2025
a54d828
address comment 003: avoid single letter variable names f, w, h, r, c
ydshieh Apr 28, 2025
27e7171
address comment 004: Correct `torch_extract_patches` docstring + Use …
ydshieh Apr 28, 2025
5e5657b
address comment 005: Remove default None values in `Kosmos2_5Processo…
ydshieh Apr 28, 2025
4ea8d5b
address comment 006: move "return_token_type_ids": False to Kosmos2_5…
ydshieh Apr 28, 2025
1d5860a
address comment 007: update in code assuming we have updated the hub …
ydshieh Apr 28, 2025
2c396ae
[**TEMP**] address comment 008: change target hub repo in tests. Need…
ydshieh Apr 28, 2025
e297649
address comment 009: Remove 2 attributes from Kosmos2_5VisionLayer be…
ydshieh Apr 28, 2025
1f5ae29
address comment 010: move `eager_attention_forward`
ydshieh Apr 28, 2025
61a8fa2
update `torch_extract_patches` docstring in `image_processing_pix2str…
ydshieh Apr 28, 2025
5c51c21
address comment 011: mark copied for `_prepare_4d_causal_attention_ma…
ydshieh Apr 28, 2025
b2bc89b
address comment 012: Add **flash_attn_kwargs and **kwargs
ydshieh Apr 28, 2025
a8e7da8
address comment 013: Add _supports_attention_backend = True
ydshieh Apr 28, 2025
f07be7e
address comment 014: Use ._from_config
ydshieh Apr 28, 2025
8eba9a7
address comment 015: remove _reorder_cache
ydshieh Apr 28, 2025
2f8daca
Remove unused position_ids
ydshieh Apr 28, 2025
1598dea
gradient_checkpointing
ydshieh Apr 29, 2025
41ace2a
fix
ydshieh Apr 29, 2025
9360be5
fix
ydshieh Apr 29, 2025
bc7c331
fix
ydshieh Apr 29, 2025
beaf91c
trigger CI
ydshieh Apr 29, 2025
30fd911
Merge branch 'main' into kosmos25_2025_05_07
ydshieh May 7, 2025
0044442
fix copied
ydshieh May 7, 2025
1f8228a
fix ruff check
ydshieh May 7, 2025
9d3f1c9
address arthur comment 00001: remove FusedRMSNorm
ydshieh May 14, 2025
531165e
address arthur comment 00002: remove "text_model_output"
ydshieh May 14, 2025
9afadc5
address arthur comment 00003: remove `# past_key_value[0] is (batch_s…
ydshieh May 14, 2025
942c6ad
address arthur comment 00004: remove # (batch_size, n_heads, seq_leng…
ydshieh May 14, 2025
ca63d57
address arthur comment 00005: add licence
ydshieh May 14, 2025
cb2e311
address arthur comment 00006: remove `add_inner_attn_layernorm` and `…
ydshieh May 14, 2025
92862d2
address arthur comment 00007: add missing copied
ydshieh May 15, 2025
fb9443b
address arthur comment 00008: vision: config.seq_len --> config.max_n…
ydshieh May 15, 2025
8935c26
address arthur comment 00009: vision: d_ff and d_kv ⚠️⚠️⚠️ need hub r…
ydshieh May 15, 2025
911daf2
address arthur comment 00010: change to input_shape and hidden_shape
ydshieh May 15, 2025
3e65e24
address arthur comment 00011: fix issues
ydshieh May 15, 2025
8afbcd0
address arthur comment 00012: fix issues
ydshieh May 15, 2025
1602f48
address arthur comment 00013: fix issues
ydshieh May 15, 2025
2bd84a3
address arthur comment 00014: fix issues
ydshieh May 15, 2025
e1cd76f
address arthur comment 00015: fix issues
ydshieh May 15, 2025
c3e3286
address arthur comment 00016: fix issues
ydshieh May 15, 2025
4b0a5ee
address arthur comment 00017: fix issues
ydshieh May 15, 2025
5a69625
address arthur comment 00018: fix issues
ydshieh May 15, 2025
cab04e4
address arthur comment 00019: (remove unused arguments)
ydshieh May 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -886,6 +886,8 @@
title: InstructBlipVideo
- local: model_doc/kosmos-2
title: KOSMOS-2
- local: model_doc/kosmos-2.5
title: KOSMOS-2.5
- local: model_doc/layoutlm
title: LayoutLM
- local: model_doc/layoutlmv2
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,7 @@ Flax), PyTorch, and/or TensorFlow.
| [JetMoe](model_doc/jetmoe) | ✅ | ❌ | ❌ |
| [Jukebox](model_doc/jukebox) | ✅ | ❌ | ❌ |
| [KOSMOS-2](model_doc/kosmos-2) | ✅ | ❌ | ❌ |
| [KOSMOS-2.5](model_doc/kosmos-2.5) | ✅ | ❌ | ❌ |
| [LayoutLM](model_doc/layoutlm) | ✅ | ✅ | ❌ |
| [LayoutLMv2](model_doc/layoutlmv2) | ✅ | ❌ | ❌ |
| [LayoutLMv3](model_doc/layoutlmv3) | ✅ | ✅ | ❌ |
Expand Down
63 changes: 63 additions & 0 deletions docs/source/en/model_doc/kosmos-2.5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# KOSMOS-2.5

## Overview

Kosmos-2.5 is a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared decoder-only auto-regressive Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.

The abstract from the paper is the following:

*We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.*

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_ocr.png"
alt="drawing" width="600"/>

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_md.png"
alt="drawing" width="600"/>

<small> Overview of tasks that KOSMOS-2.5 can handle. Taken from the <a href="https://arxiv.org/abs/2309.11419">original paper</a>. </small>

## Example
**Markdown Task:** For usage instructions, please refer to [md.py](https://huggingface.co/microsoft/kosmos-2.5/blob/main/md.py).

**OCR Task:** For usage instructions, please refer to [ocr.py](https://huggingface.co/microsoft/kosmos-2.5/blob/main/ocr.py).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to include the code snippets here so users don't have to click on another link




## Kosmos2_5Config

[[autodoc]] Kosmos2_5Config

## Kosmos2_5ImageProcessor

[[autodoc]] Kosmos2_5ImageProcessor

## Kosmos2_5Processor

[[autodoc]] Kosmos2_5Processor
- __call__

## Kosmos2_5Model

[[autodoc]] Kosmos2_5Model
- forward

## Kosmos2_5ForConditionalGeneration

[[autodoc]] Kosmos2_5ForConditionalGeneration
- forward
2 changes: 2 additions & 0 deletions docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
* [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel)
* [Kosmos-2.5](https://huggingface.co/docs/transformers/model_doc/kosmos2_5#transformers.Kosmos2_5Model)
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
* [Llava](https://huggingface.co/docs/transformers/model_doc/llava)
* [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next)
Expand Down Expand Up @@ -263,6 +264,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [GraniteMoe](https://huggingface.co/docs/transformers/model_doc/granitemoe#transformers.GraniteMoeModel)
* [JetMoe](https://huggingface.co/docs/transformers/model_doc/jetmoe#transformers.JetMoeModel)
* [Jamba](https://huggingface.co/docs/transformers/model_doc/jamba#transformers.JambaModel)
* [Kosmos-2.5](https://huggingface.co/docs/transformers/model_doc/kosmos2_5#transformers.Kosmos2_5Model)
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
* [Llava](https://huggingface.co/docs/transformers/model_doc/llava)
* [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next)
Expand Down
22 changes: 22 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -527,6 +527,10 @@
"Kosmos2Config",
"Kosmos2Processor",
],
"models.kosmos2_5": [
"Kosmos2_5Config",
"Kosmos2_5Processor",
],
"models.layoutlm": [
"LayoutLMConfig",
"LayoutLMTokenizer",
Expand Down Expand Up @@ -1240,6 +1244,7 @@
_import_structure["models.idefics3"].extend(["Idefics3ImageProcessor"])
_import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
_import_structure["models.instructblipvideo"].extend(["InstructBlipVideoImageProcessor"])
_import_structure["models.kosmos2_5"].extend(["Kosmos2_5ImageProcessor"])
_import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"])
_import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"])
_import_structure["models.levit"].extend(["LevitFeatureExtractor", "LevitImageProcessor"])
Expand Down Expand Up @@ -2634,6 +2639,13 @@
"Kosmos2PreTrainedModel",
]
)
_import_structure["models.kosmos2_5"].extend(
[
"Kosmos2_5ForConditionalGeneration",
"Kosmos2_5Model",
"Kosmos2_5PreTrainedModel",
]
)
_import_structure["models.layoutlm"].extend(
[
"LayoutLMForMaskedLM",
Expand Down Expand Up @@ -5578,6 +5590,10 @@
Kosmos2Config,
Kosmos2Processor,
)
from .models.kosmos2_5 import (
Kosmos2_5Config,
Kosmos2_5Processor,
)
from .models.layoutlm import (
LayoutLMConfig,
LayoutLMTokenizer,
Expand Down Expand Up @@ -6326,6 +6342,7 @@
from .models.idefics3 import Idefics3ImageProcessor
from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
from .models.instructblipvideo import InstructBlipVideoImageProcessor
from .models.kosmos2_5 import Kosmos2_5ImageProcessor
from .models.layoutlmv2 import (
LayoutLMv2FeatureExtractor,
LayoutLMv2ImageProcessor,
Expand Down Expand Up @@ -7487,6 +7504,11 @@
Kosmos2Model,
Kosmos2PreTrainedModel,
)
from .models.kosmos2_5 import (
Kosmos2_5ForConditionalGeneration,
Kosmos2_5Model,
Kosmos2_5PreTrainedModel,
)
from .models.layoutlm import (
LayoutLMForMaskedLM,
LayoutLMForQuestionAnswering,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@
jamba,
jetmoe,
kosmos2,
kosmos2_5,
layoutlm,
layoutlmv2,
layoutlmv3,
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,7 @@
("jetmoe", "JetMoeConfig"),
("jukebox", "JukeboxConfig"),
("kosmos-2", "Kosmos2Config"),
("kosmos-2.5", "Kosmos2_5Config"),
("layoutlm", "LayoutLMConfig"),
("layoutlmv2", "LayoutLMv2Config"),
("layoutlmv3", "LayoutLMv3Config"),
Expand Down Expand Up @@ -478,6 +479,7 @@
("jetmoe", "JetMoe"),
("jukebox", "Jukebox"),
("kosmos-2", "KOSMOS-2"),
("kosmos-2.5", "KOSMOS-2.5"),
("layoutlm", "LayoutLM"),
("layoutlmv2", "LayoutLMv2"),
("layoutlmv3", "LayoutLMv3"),
Expand Down Expand Up @@ -717,6 +719,7 @@
("data2vec-vision", "data2vec"),
("donut-swin", "donut"),
("kosmos-2", "kosmos2"),
("kosmos-2.5", "kosmos2_5"),
("maskformer-swin", "maskformer"),
("xclip", "x_clip"),
("clip_vision_model", "clip"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@
("instructblip", ("BlipImageProcessor",)),
("instructblipvideo", ("InstructBlipVideoImageProcessor",)),
("kosmos-2", ("CLIPImageProcessor",)),
("kosmos-2.5", ("Kosmos2_5ImageProcessor",)),
("layoutlmv2", ("LayoutLMv2ImageProcessor",)),
("layoutlmv3", ("LayoutLMv3ImageProcessor",)),
("levit", ("LevitImageProcessor",)),
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@
("jetmoe", "JetMoeModel"),
("jukebox", "JukeboxModel"),
("kosmos-2", "Kosmos2Model"),
("kosmos-2.5", "Kosmos2_5Model"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to have a "." in the model name? for other models we have "_", e.g. "qwen2_5_vl"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kosmos-2.5 is not in the model name or model file name: it is the model_type. It's OK I believe although it seems the unique one with .

("layoutlm", "LayoutLMModel"),
("layoutlmv2", "LayoutLMv2Model"),
("layoutlmv3", "LayoutLMv3Model"),
Expand Down Expand Up @@ -778,6 +779,7 @@
("instructblip", "InstructBlipForConditionalGeneration"),
("instructblipvideo", "InstructBlipVideoForConditionalGeneration"),
("kosmos-2", "Kosmos2ForConditionalGeneration"),
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"),
("llava", "LlavaForConditionalGeneration"),
("llava_next", "LlavaNextForConditionalGeneration"),
("llava_next_video", "LlavaNextVideoForConditionalGeneration"),
Expand Down Expand Up @@ -812,6 +814,7 @@
("idefics3", "Idefics3ForConditionalGeneration"),
("instructblip", "InstructBlipForConditionalGeneration"),
("kosmos-2", "Kosmos2ForConditionalGeneration"),
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"),
("llava", "LlavaForConditionalGeneration"),
("llava_next", "LlavaNextForConditionalGeneration"),
("llava_onevision", "LlavaOnevisionForConditionalGeneration"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@
("instructblip", "InstructBlipProcessor"),
("instructblipvideo", "InstructBlipVideoProcessor"),
("kosmos-2", "Kosmos2Processor"),
("kosmos-2.5", "Kosmos2_5Processor"),
("layoutlmv2", "LayoutLMv2Processor"),
("layoutlmv3", "LayoutLMv3Processor"),
("llava", "LlavaProcessor"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,7 @@
"XLMRobertaTokenizerFast" if is_tokenizers_available() else None,
),
),
("kosmos-2.5", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("layoutlm", ("LayoutLMTokenizer", "LayoutLMTokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv2", ("LayoutLMv2Tokenizer", "LayoutLMv2TokenizerFast" if is_tokenizers_available() else None)),
("layoutlmv3", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/kosmos2/modeling_kosmos2.py
Original file line number Diff line number Diff line change
Expand Up @@ -2073,6 +2073,7 @@ def forward(
vision_model_output=vision_model_output,
)

@torch.no_grad()
def generate(
self,
pixel_values: Optional[torch.Tensor] = None,
Expand Down
30 changes: 30 additions & 0 deletions src/transformers/models/kosmos2_5/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# coding=utf-8
# Copyright 2024 Microsoft Research and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_kosmos2_5 import *
from .image_processing_kosmos2_5 import *
from .modeling_kosmos2_5 import *
from .processing_kosmos2_5 import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading