dit base finetuned rvlcdip

microsoft

Introduction

The Document Image Transformer (DiT) is a model designed for document image classification. Pre-trained on the IIT-CDIP dataset and fine-tuned on the RVL-CDIP dataset, it is introduced in the paper "DiT: Self-supervised Pre-training for Document Image Transformer" by Li et al. The model is based on the BEiT architecture.

Architecture

DiT is a transformer encoder model similar to BERT. It uses a self-supervised learning method where the pre-training objective is to predict visual tokens from the encoder of a discrete VAE (dVAE) based on masked patches. Images are divided into fixed-size patches (16x16) and are linearly embedded before being processed by the Transformer encoder layers.

Training

The model is pre-trained on a large dataset of document images to learn image features. These features are useful for tasks like document image classification and layout analysis. The fine-tuning process involves using labeled document images to train a classifier by adding a linear layer on top of the pre-trained encoder.

Guide: Running Locally

To use the DiT model in PyTorch, follow these steps:

  1. Install the Transformers library:

    pip install transformers
    
  2. Load an image and model:

    from transformers import AutoImageProcessor, AutoModelForImageClassification
    import torch
    from PIL import Image
    
    image = Image.open('path_to_your_document_image').convert('RGB')
    processor = AutoImageProcessor.from_pretrained("microsoft/dit-base-finetuned-rvlcdip")
    model = AutoModelForImageClassification.from_pretrained("microsoft/dit-base-finetuned-rvlcdip")
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    
    predicted_class_idx = logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class_idx])
    
  3. Cloud GPUs: For intensive tasks, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure to enhance processing speed and efficiency.

License

The DiT model is available under the licensing terms specified by Microsoft and Hugging Face. It is important to review these terms to ensure compliance with usage guidelines and restrictions.

More Related APIs in Image Classification