Introduction

The DONUT (Document Understanding Transformer) model is a pre-trained image-to-text model developed by Geewok et al. It features a vision encoder (Swin Transformer) and a text decoder (BART) designed for document image understanding without the need for OCR (Optical Character Recognition). The model is intended for tasks like document image classification and parsing, requiring fine-tuning for specific applications.

Architecture

The architecture of DONUT includes:

  • Vision Encoder: Utilizes a Swin Transformer to convert input images into a tensor of embeddings with dimensions (batch_size, seq_len, hidden_size).
  • Text Decoder: Employs BART to generate text autoregressively, conditioned on the encoded image data.

Architecture

Training

The DONUT model is pre-trained and intended for fine-tuning on downstream tasks. While it is not detailed in Hugging Face's documentation, users are encouraged to look for fine-tuned versions suitable for their specific tasks on the model hub.

Guide: Running Locally

  1. Install Dependencies: Make sure you have Python and PyTorch installed. You will also need the transformers library from Hugging Face.
  2. Clone Repository: Use the repository link to clone the model's GitHub repository for access to the pre-trained model and scripts.
  3. Load Model: Utilize the transformers library to load the DONUT model and tokenizer.
  4. Fine-Tuning: Consider fine-tuning the model on your local dataset for specific tasks like document parsing.
  5. Cloud GPUs: For better performance, especially during fine-tuning, consider using cloud GPUs from providers like AWS, GCP, or Azure.

License

The DONUT model is released under the MIT License, permitting reuse with minimal restrictions.

More Related APIs in Image To Text