Kosmos-2.5 implementation in transformers

### Model description

Hello everyone,

The Kosmos-2.5 is a multimodal literate model that can be used for tasks such as OCR and text-rich image comprehension. It includes a ViT encoder, a Resampler, and a shared decoder module. To the best of my knowledge, the architecture of this model is similar to Kosmos-2 but has some differences. Due to these differences, using this model in Transformers requires a standalone implementation.

### Open source status

- [X] The model implementation is available
- [X] The model weights are available

### Provide useful links for the implementation

Paper: https://arxiv.org/pdf/2309.11419
Code: https://github.com/microsoft/unilm/tree/master/kosmos-2.5
Authors: @Dod-o @wolfshow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kosmos-2.5 implementation in transformers #30877

Model description

Open source status

Provide useful links for the implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kosmos-2.5 implementation in transformers #30877

Description

Model description

Open source status

Provide useful links for the implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions