Kosmos-2.5

Introduction

Kosmos-2.5 is a multimodal literate model developed by Microsoft, designed for machine reading of text-intensive images. It excels in generating spatially-aware text blocks and producing Markdown-formatted structured text from images.

Architecture

The model utilizes a shared decoder-only auto-regressive Transformer architecture. It is pre-trained on large-scale text-intensive images and employs task-specific prompts and flexible text representations to perform its tasks efficiently.

Training

Kosmos-2.5 is pre-trained on extensive datasets of text-intensive images, allowing it to handle transcription tasks effectively. It can be adapted for various tasks through supervised fine-tuning, making it versatile for real-world applications.

Guide: Running Locally

To run Kosmos-2.5 locally:

Clone the repository from GitHub.
Install the required dependencies.
Run md.py for Markdown tasks or ocr.py for OCR tasks to start using the model.

For optimal performance, it is recommended to use cloud GPUs such as those provided by AWS, Google Cloud, or Azure.

License

The Kosmos-2.5 model is licensed under the MIT License. More details are available in the license file. The project also adheres to the Microsoft Open Source Code of Conduct.