trocr base printed LLM Model

Introduction

The TrOCR model, developed by Microsoft, is a Transformer-based model for Optical Character Recognition (OCR). It is fine-tuned on the SROIE dataset and introduced in the paper "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models" by Li et al.

Architecture

TrOCR is an encoder-decoder model consisting of an image Transformer as the encoder and a text Transformer as the decoder. The encoder is initialized with BEiT weights, and the decoder is initialized with RoBERTa weights. The model processes images as sequences of fixed-size patches (16x16 resolution), which are linearly embedded with absolute position embeddings before being fed into the Transformer encoder. The text decoder then generates tokens autoregressively.

Training

The model is fine-tuned on the SROIE dataset, which is designed for text recognition tasks. The training leverages pre-trained Transformers for both the image and text components, enhancing the model's ability to recognize text from images effectively.

Guide: Running Locally

Here is a basic guide to run the model locally using PyTorch:

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests

# Load image
url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# Load processor and model
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-printed')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-printed')

# Process image and generate text
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

For optimal performance, using cloud GPUs such as NVIDIA Tesla or V100 may be beneficial, especially when processing large volumes of data or high-resolution images.

License

The TrOCR model is released under the MIT License, allowing for broad usage and modification while ensuring attribution to the original authors.

More Related APIs in Image To Text