-
Notifications
You must be signed in to change notification settings - Fork 29.8k
Open
Labels
Description
Model description
Hello everyone,
The Kosmos-2.5 is a multimodal literate model that can be used for tasks such as OCR and text-rich image comprehension. It includes a ViT encoder, a Resampler, and a shared decoder module. To the best of my knowledge, the architecture of this model is similar to Kosmos-2 but has some differences. Due to these differences, using this model in Transformers requires a standalone implementation.
Open source status
- The model implementation is available
- The model weights are available
Provide useful links for the implementation
Paper: https://arxiv.org/pdf/2309.11419
Code: https://github.com/microsoft/unilm/tree/master/kosmos-2.5
Authors: @Dod-o @wolfshow
wolfshow, masonjames, mit1280, marklabz, jwlaro-bsi and 2 more