donut base finetuned cord v2

naver-clova-ix

Introduction

The DONUT (Document Understanding Transformer) is a base-sized model fine-tuned on the CORD dataset, designed for OCR-free document understanding tasks. It combines vision and text processing capabilities to convert images into meaningful text outputs.

Architecture

The DONUT model integrates a vision encoder, specifically the Swin Transformer, with a text decoder, BART. The encoder processes the input image and transforms it into a tensor of embeddings. Subsequently, the text decoder generates text autoregressively, conditioned on these embeddings.

Model Architecture

Training

The DONUT model is fine-tuned using the CORD dataset, which focuses on document parsing. The approach eliminates the need for traditional OCR by directly utilizing image-to-text capabilities.

Guide: Running Locally

To run the DONUT model locally, follow these steps:

  1. Install Required Libraries: Ensure you have Python and PyTorch installed. Use pip to install the Hugging Face Transformers library.

    pip install transformers
    
  2. Load the Model: Use the Transformers library to load the pre-trained model.

    from transformers import DonutProcessor, VisionEncoderDecoderModel
    
    model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
    processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
    
  3. Inference: Prepare your image and use the model for inference.

    from PIL import Image
    
    image = Image.open("path_to_your_image.jpg")
    inputs = processor(image, return_tensors="pt")
    outputs = model.generate(**inputs)
    
    print(processor.decode(outputs[0]))
    
  4. Consider Cloud GPUs: For efficient processing, especially with large datasets or high-resolution images, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

The DONUT model is released under the MIT License, allowing for wide usage and modification.

More Related APIs in Image To Text