-
Notifications
You must be signed in to change notification settings - Fork 29.8k
Support Kosmos-2.5 #31711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support Kosmos-2.5 #31711
Changes from 250 commits
eb116ab
e1ab413
fe418d0
cc7d28f
e5ffaee
b5ebf09
dd12798
15feaea
ab687f5
87ab935
54b1984
5e5a9e9
0ed8541
df9d3ad
55cb12d
d99934d
c705049
806ca1b
9e620b6
65490b4
d0bf57e
f5d4439
40ff015
63603d6
f8497ce
eab8e69
a6154db
94cc6d2
968b033
142604d
4b7bc95
6f2bd73
08e1cb0
f2dae0d
fcc095f
f66c6ee
9a8479d
830671b
1c58c8f
0153a08
ac94b57
52788cc
0b9e5ad
925e14a
6ed504d
9a841ad
dcced48
91fa383
e3802f4
85da449
b1db4f2
f8c98d6
fbb3e59
ce3a6b0
90c4fcc
00e324d
9c8aff7
2c47915
395a636
8a058d9
c639eeb
b688c4f
d1c52f4
3a58742
d5b8349
9ddc86b
39dc6ef
b2c3db2
c356a36
55944fc
83d600e
2d4cbba
6b2f7d7
5f731a9
0ec499a
7f0d26c
db865db
bf14c4b
9b29aac
ce222a6
876cb6b
a3638ea
30f927a
a65a9b1
7c99fd0
ec9ea0c
8fc9699
22cb70d
001fd70
d1116f5
6f09a51
7d0b827
cd018b0
a5b23f8
d1debcc
1279316
69aec2e
8c579a9
af813ce
ca60142
19da4a2
777a3e2
1ace3d1
8d2e51f
7b65626
89c6901
59700f9
4988c47
3a411a3
4036920
59c21c9
48c6965
f36ef6f
d812476
f339e50
f81256a
beb281c
a6ff4d2
fb62fd6
9a90c54
cf804ac
8fc5a31
0593308
6a55353
2050bc3
9cdfdf3
c49d565
a809295
95dc35c
bcaf808
917dcc8
42c3216
cd35c34
185f370
0c0c485
0216ac8
6b288c3
94e563a
ad6ded5
17b78dd
b9fc031
0ece5c7
fba70ba
4281ff3
6e071c7
a5318da
ead54fd
6c320a6
783877d
7399d8a
c01c60d
902a030
0323624
740386c
9744cb6
19330d2
d5df504
5e3a2e6
0798236
e835f82
b99e679
dd51797
958adb7
bd2083e
9848655
d82f6af
77d0ea2
b9062f5
28a7c36
cc2b7bd
f2c2752
5b72c43
42229b6
ac9c77e
d838cdc
fd1cad2
3e5b033
67a0b79
a7961cb
2c44bd2
bd4d18b
5450db1
e36ea15
9831c1b
e2e9ed9
464d93e
c5981d0
dd3f5ae
5df3ade
625d473
98aa5a5
41f1f4c
44bc32e
7f9ab92
d8e3d52
4e28b99
49b5d8d
e6b6969
439c1d7
61b3a2a
0ad75d7
4baab28
05a64fd
afbe1d7
05d65f7
3703cc4
00903d7
2e6222c
a54d828
27e7171
5e5657b
4ea8d5b
1d5860a
2c396ae
e297649
1f5ae29
61a8fa2
5c51c21
b2bc89b
a8e7da8
f07be7e
8eba9a7
2f8daca
1598dea
41ace2a
9360be5
bc7c331
beaf91c
30fd911
0044442
1f8228a
9d3f1c9
531165e
9afadc5
942c6ad
ca63d57
cb2e311
92862d2
fb9443b
8935c26
911daf2
3e65e24
8afbcd0
1602f48
2bd84a3
e1cd76f
c3e3286
4b0a5ee
5a69625
cab04e4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
|
||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
rendered properly in your Markdown viewer. | ||
|
||
--> | ||
|
||
# KOSMOS-2.5 | ||
|
||
## Overview | ||
|
||
Kosmos-2.5 is a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared decoder-only auto-regressive Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models. | ||
|
||
The abstract from the paper is the following: | ||
|
||
*We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.* | ||
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_ocr.png" | ||
alt="drawing" width="600"/> | ||
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_md.png" | ||
alt="drawing" width="600"/> | ||
|
||
<small> Overview of tasks that KOSMOS-2.5 can handle. Taken from the <a href="https://arxiv.org/abs/2309.11419">original paper</a>. </small> | ||
|
||
## Example | ||
**Markdown Task:** For usage instructions, please refer to [md.py](https://huggingface.co/microsoft/kosmos-2.5/blob/main/md.py). | ||
|
||
**OCR Task:** For usage instructions, please refer to [ocr.py](https://huggingface.co/microsoft/kosmos-2.5/blob/main/ocr.py). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be nice to include the code snippets here so users don't have to click on another link |
||
|
||
|
||
|
||
## Kosmos2_5Config | ||
|
||
[[autodoc]] Kosmos2_5Config | ||
|
||
## Kosmos2_5ImageProcessor | ||
|
||
[[autodoc]] Kosmos2_5ImageProcessor | ||
|
||
## Kosmos2_5Processor | ||
|
||
[[autodoc]] Kosmos2_5Processor | ||
- __call__ | ||
|
||
## Kosmos2_5Model | ||
|
||
[[autodoc]] Kosmos2_5Model | ||
- forward | ||
|
||
## Kosmos2_5ForConditionalGeneration | ||
|
||
[[autodoc]] Kosmos2_5ForConditionalGeneration | ||
- forward |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -133,6 +133,7 @@ | |
jamba, | ||
jetmoe, | ||
kosmos2, | ||
kosmos2_5, | ||
layoutlm, | ||
layoutlmv2, | ||
layoutlmv3, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -147,6 +147,7 @@ | |
("jetmoe", "JetMoeModel"), | ||
("jukebox", "JukeboxModel"), | ||
("kosmos-2", "Kosmos2Model"), | ||
("kosmos-2.5", "Kosmos2_5Model"), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it ok to have a "." in the model name? for other models we have "_", e.g. "qwen2_5_vl" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
("layoutlm", "LayoutLMModel"), | ||
("layoutlmv2", "LayoutLMv2Model"), | ||
("layoutlmv3", "LayoutLMv3Model"), | ||
|
@@ -778,6 +779,7 @@ | |
("instructblip", "InstructBlipForConditionalGeneration"), | ||
("instructblipvideo", "InstructBlipVideoForConditionalGeneration"), | ||
("kosmos-2", "Kosmos2ForConditionalGeneration"), | ||
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"), | ||
("llava", "LlavaForConditionalGeneration"), | ||
("llava_next", "LlavaNextForConditionalGeneration"), | ||
("llava_next_video", "LlavaNextVideoForConditionalGeneration"), | ||
|
@@ -812,6 +814,7 @@ | |
("idefics3", "Idefics3ForConditionalGeneration"), | ||
("instructblip", "InstructBlipForConditionalGeneration"), | ||
("kosmos-2", "Kosmos2ForConditionalGeneration"), | ||
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"), | ||
("llava", "LlavaForConditionalGeneration"), | ||
("llava_next", "LlavaNextForConditionalGeneration"), | ||
("llava_onevision", "LlavaOnevisionForConditionalGeneration"), | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# coding=utf-8 | ||
# Copyright 2024 Microsoft Research and The HuggingFace Inc. team. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
from typing import TYPE_CHECKING | ||
|
||
from ...utils import _LazyModule | ||
from ...utils.import_utils import define_import_structure | ||
|
||
|
||
if TYPE_CHECKING: | ||
from .configuration_kosmos2_5 import * | ||
from .image_processing_kosmos2_5 import * | ||
from .modeling_kosmos2_5 import * | ||
from .processing_kosmos2_5 import * | ||
else: | ||
import sys | ||
|
||
_file = globals()["__file__"] | ||
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__) |
Uh oh!
There was an error while loading. Please reload this page.