kosmos 2.5
microsoftKosmos-2.5
Introduction
Kosmos-2.5 is a multimodal literate model developed by Microsoft, designed for machine reading of text-intensive images. It excels in generating spatially-aware text blocks and producing Markdown-formatted structured text from images.
Architecture
The model utilizes a shared decoder-only auto-regressive Transformer architecture. It is pre-trained on large-scale text-intensive images and employs task-specific prompts and flexible text representations to perform its tasks efficiently.
Training
Kosmos-2.5 is pre-trained on extensive datasets of text-intensive images, allowing it to handle transcription tasks effectively. It can be adapted for various tasks through supervised fine-tuning, making it versatile for real-world applications.
Guide: Running Locally
To run Kosmos-2.5 locally:
- Clone the repository from GitHub.
- Install the required dependencies.
- Run
md.py
for Markdown tasks orocr.py
for OCR tasks to start using the model.
For optimal performance, it is recommended to use cloud GPUs such as those provided by AWS, Google Cloud, or Azure.
License
The Kosmos-2.5 model is licensed under the MIT License. More details are available in the license file. The project also adheres to the Microsoft Open Source Code of Conduct.