donut base finetuned cord v2
naver-clova-ixIntroduction
The DONUT (Document Understanding Transformer) is a base-sized model fine-tuned on the CORD dataset, designed for OCR-free document understanding tasks. It combines vision and text processing capabilities to convert images into meaningful text outputs.
Architecture
The DONUT model integrates a vision encoder, specifically the Swin Transformer, with a text decoder, BART. The encoder processes the input image and transforms it into a tensor of embeddings. Subsequently, the text decoder generates text autoregressively, conditioned on these embeddings.
Training
The DONUT model is fine-tuned using the CORD dataset, which focuses on document parsing. The approach eliminates the need for traditional OCR by directly utilizing image-to-text capabilities.
Guide: Running Locally
To run the DONUT model locally, follow these steps:
-
Install Required Libraries: Ensure you have Python and PyTorch installed. Use
pip
to install the Hugging Face Transformers library.pip install transformers
-
Load the Model: Use the Transformers library to load the pre-trained model.
from transformers import DonutProcessor, VisionEncoderDecoderModel model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2") processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
-
Inference: Prepare your image and use the model for inference.
from PIL import Image image = Image.open("path_to_your_image.jpg") inputs = processor(image, return_tensors="pt") outputs = model.generate(**inputs) print(processor.decode(outputs[0]))
-
Consider Cloud GPUs: For efficient processing, especially with large datasets or high-resolution images, consider using cloud GPU services like AWS, Google Cloud, or Azure.
License
The DONUT model is released under the MIT License, allowing for wide usage and modification.