donut base finetuned docvqa LLM Model

Introduction

The Donut model, fine-tuned on the DocVQA dataset, is designed for Document Visual Question Answering. It was introduced in the paper "OCR-free Document Understanding Transformer" by Geewok et al. The model combines a vision encoder and a text decoder to process images and generate text responses.

Architecture

Donut utilizes a vision encoder (Swin Transformer) and a text decoder (BART). The encoder processes an image into a tensor of embeddings, which the decoder uses to generate text responses autoregressively. This architecture allows the model to understand and answer questions about documents without the need for Optical Character Recognition (OCR).

Donut Architecture

Training

The model is fine-tuned on the DocVQA dataset, which is specifically curated for Document Visual Question Answering tasks. This fine-tuning process allows the model to better understand and extract information from document images to answer queries.

Guide: Running Locally

Install Dependencies: Ensure that you have Python and PyTorch installed. You can install necessary libraries using pip:
```
pip install transformers torch
```

Load the Model: Use the Hugging Face transformers library to load the model.

from transformers import DonutProcessor, VisionEncoderDecoderModel

processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")

Prepare Data: Load your document images in a format compatible with the model.
Inference: Pass the images through the model to get answers for your questions.

For intensive computations, consider using cloud GPU services such as AWS EC2, Google Cloud Compute Engine, or Azure Virtual Machines to speed up the processing.

License

This model is released under the MIT License, allowing for commercial use, modification, distribution, and private use.

More Related APIs in Document Question Answering