donut base
naver-clova-ixIntroduction
The DONUT (Document Understanding Transformer) model is a pre-trained image-to-text model developed by Geewok et al. It features a vision encoder (Swin Transformer) and a text decoder (BART) designed for document image understanding without the need for OCR (Optical Character Recognition). The model is intended for tasks like document image classification and parsing, requiring fine-tuning for specific applications.
Architecture
The architecture of DONUT includes:
- Vision Encoder: Utilizes a Swin Transformer to convert input images into a tensor of embeddings with dimensions
(batch_size, seq_len, hidden_size)
. - Text Decoder: Employs BART to generate text autoregressively, conditioned on the encoded image data.
Training
The DONUT model is pre-trained and intended for fine-tuning on downstream tasks. While it is not detailed in Hugging Face's documentation, users are encouraged to look for fine-tuned versions suitable for their specific tasks on the model hub.
Guide: Running Locally
- Install Dependencies: Make sure you have Python and PyTorch installed. You will also need the
transformers
library from Hugging Face. - Clone Repository: Use the repository link to clone the model's GitHub repository for access to the pre-trained model and scripts.
- Load Model: Utilize the
transformers
library to load the DONUT model and tokenizer. - Fine-Tuning: Consider fine-tuning the model on your local dataset for specific tasks like document parsing.
- Cloud GPUs: For better performance, especially during fine-tuning, consider using cloud GPUs from providers like AWS, GCP, or Azure.
License
The DONUT model is released under the MIT License, permitting reuse with minimal restrictions.