trocr large printed

microsoft

Introduction

The TrOCR model is a large-sized model fine-tuned on the SROIE dataset, designed for Optical Character Recognition (OCR). It was introduced in the paper "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models" by Li et al. This model is intended for OCR tasks on single text-line images.

Architecture

TrOCR is an encoder-decoder model. The encoder is an image Transformer initialized from BEiT weights, while the decoder is a text Transformer initialized from RoBERTa weights. Images are processed as sequences of 16x16 patches, which are linearly embedded with additional absolute position embeddings before passing through the Transformer encoder. The text decoder generates tokens autoregressively.

Training

The TrOCR model has been fine-tuned on the SROIE dataset to enhance its OCR capabilities. The initial weights for the encoder and decoder components were sourced from pre-trained BEiT and RoBERTa models, respectively.

Guide: Running Locally

To run the model locally using PyTorch:

  1. Install Dependencies: Ensure you have the transformers library installed, along with PIL and requests for handling images and HTTP requests.

    pip install transformers pillow requests
    
  2. Load and Process Image: Use the following code to load an image and process it:

    from transformers import TrOCRProcessor, VisionEncoderDecoderModel
    from PIL import Image
    import requests
    
    url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    
    processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-printed')
    model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-printed')
    pixel_values = processor(images=image, return_tensors="pt").pixel_values
    
    generated_ids = model.generate(pixel_values)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
  3. Output: The generated_text variable will contain the recognized text from the image.

For better performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

The model and its components are released by Microsoft and made available through Hugging Face. Users should refer to the licensing terms provided within the Hugging Face platform and Microsoft's repository for specific usage rights and restrictions.

More Related APIs in Image To Text