dit base finetuned rvlcdip
microsoftIntroduction
The Document Image Transformer (DiT) is a model designed for document image classification. Pre-trained on the IIT-CDIP dataset and fine-tuned on the RVL-CDIP dataset, it is introduced in the paper "DiT: Self-supervised Pre-training for Document Image Transformer" by Li et al. The model is based on the BEiT architecture.
Architecture
DiT is a transformer encoder model similar to BERT. It uses a self-supervised learning method where the pre-training objective is to predict visual tokens from the encoder of a discrete VAE (dVAE) based on masked patches. Images are divided into fixed-size patches (16x16) and are linearly embedded before being processed by the Transformer encoder layers.
Training
The model is pre-trained on a large dataset of document images to learn image features. These features are useful for tasks like document image classification and layout analysis. The fine-tuning process involves using labeled document images to train a classifier by adding a linear layer on top of the pre-trained encoder.
Guide: Running Locally
To use the DiT model in PyTorch, follow these steps:
-
Install the Transformers library:
pip install transformers
-
Load an image and model:
from transformers import AutoImageProcessor, AutoModelForImageClassification import torch from PIL import Image image = Image.open('path_to_your_document_image').convert('RGB') processor = AutoImageProcessor.from_pretrained("microsoft/dit-base-finetuned-rvlcdip") model = AutoModelForImageClassification.from_pretrained("microsoft/dit-base-finetuned-rvlcdip") inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx])
-
Cloud GPUs: For intensive tasks, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure to enhance processing speed and efficiency.
License
The DiT model is available under the licensing terms specified by Microsoft and Hugging Face. It is important to review these terms to ensure compliance with usage guidelines and restrictions.