Skip to content

Kosmos-2.5 implementation in transformers #30877

@Natyren

Description

@Natyren

Model description

Hello everyone,

The Kosmos-2.5 is a multimodal literate model that can be used for tasks such as OCR and text-rich image comprehension. It includes a ViT encoder, a Resampler, and a shared decoder module. To the best of my knowledge, the architecture of this model is similar to Kosmos-2 but has some differences. Due to these differences, using this model in Transformers requires a standalone implementation.

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

Paper: https://arxiv.org/pdf/2309.11419
Code: https://github.com/microsoft/unilm/tree/master/kosmos-2.5
Authors: @Dod-o @wolfshow

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions