donut base finetuned docvqa
XenovaIntroduction
The Xenova/donut-base-finetuned-docvqa
is a model designed for document question answering tasks. It leverages the capabilities of the donut
architecture, which is part of the vision-encoder-decoder family, for image-to-text tasks. This model is fine-tuned specifically for the Document Visual Question Answering (DocVQA) challenge.
Architecture
This model uses the donut
architecture, integrated with donut-swin
components, and operates with ONNX weights to ensure compatibility with Transformers.js
. This setup allows for efficient image-to-text processing, making it suitable for extracting answers from document images.
Training
The model is a fine-tuned version of the naver-clova-ix/donut-base
on the DocVQA dataset. It is optimized for answering questions based on visual document inputs, employing transformer networks to understand and generate accurate responses.
Guide: Running Locally
To run the model locally using Transformers.js
, follow these steps:
-
Install Transformers.js Library: Install the library via NPM:
npm i @huggingface/transformers
-
Setup the Pipeline: Use the following example to create a document question answering pipeline:
import { pipeline } from '@huggingface/transformers'; const qa_pipeline = await pipeline('document-question-answering', 'Xenova/donut-base-finetuned-docvqa');
-
Provide Input: Use an image URL and a question to receive answers:
const image = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/invoice.png'; const question = 'What is the invoice number?'; const output = await qa_pipeline(image, question); // [{ answer: 'us-001' }]
-
Cloud GPUs: For enhanced performance, consider using cloud GPU services like AWS, GCP, or Azure.
License
The model and associated tools are available under the terms specified in the model's repository on Hugging Face. Ensure compliance with these terms when utilizing the model in your projects.