donut base finetuned docvqa LLM Model

Introduction

The Xenova/donut-base-finetuned-docvqa is a model designed for document question answering tasks. It leverages the capabilities of the donut architecture, which is part of the vision-encoder-decoder family, for image-to-text tasks. This model is fine-tuned specifically for the Document Visual Question Answering (DocVQA) challenge.

Architecture

This model uses the donut architecture, integrated with donut-swin components, and operates with ONNX weights to ensure compatibility with Transformers.js. This setup allows for efficient image-to-text processing, making it suitable for extracting answers from document images.

Training

The model is a fine-tuned version of the naver-clova-ix/donut-base on the DocVQA dataset. It is optimized for answering questions based on visual document inputs, employing transformer networks to understand and generate accurate responses.

Guide: Running Locally

To run the model locally using Transformers.js, follow these steps:

Install Transformers.js Library: Install the library via NPM:
```
npm i @huggingface/transformers
```

Setup the Pipeline: Use the following example to create a document question answering pipeline:

import { pipeline } from '@huggingface/transformers';

const qa_pipeline = await pipeline('document-question-answering', 'Xenova/donut-base-finetuned-docvqa');

Provide Input: Use an image URL and a question to receive answers:

const image = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/invoice.png';
const question = 'What is the invoice number?';
const output = await qa_pipeline(image, question);
// [{ answer: 'us-001' }]

Cloud GPUs: For enhanced performance, consider using cloud GPU services like AWS, GCP, or Azure.

License

The model and associated tools are available under the terms specified in the model's repository on Hugging Face. Ensure compliance with these terms when utilizing the model in your projects.

More Related APIs in Document Question Answering