trocr large stage1
microsoftIntroduction
TrOCR is a large-sized pre-trained model designed for Optical Character Recognition (OCR) tasks. It was introduced in the paper "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models" by Li et al. The model is based on a transformer architecture, leveraging pre-trained weights from BEiT for the image encoder and RoBERTa for the text decoder.
Architecture
TrOCR employs an encoder-decoder framework. The encoder is an image transformer initialized with BEiT weights, while the decoder is a text transformer initialized with RoBERTa weights. Images are split into fixed-size patches (16x16) and linearly embedded, with absolute position embeddings added before processing through the transformer encoder. The text transformer decoder then generates tokens autoregressively.
Training
To use this model for OCR, images are processed into pixel values using the TrOCRProcessor
, and the model generates outputs from these pixel values. The model is suitable for single text-line images, and fine-tuned versions can be found for specific tasks on the Hugging Face model hub.
Guide: Running Locally
-
Setup Environment:
- Ensure you have Python installed.
- Install the necessary libraries:
pip install transformers pillow torch
-
Load Model and Processor:
from transformers import TrOCRProcessor, VisionEncoderDecoderModel from PIL import Image import requests url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg' image = Image.open(requests.get(url, stream=True).raw).convert("RGB") processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-stage1') model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-stage1') pixel_values = processor(image, return_tensors="pt").pixel_values decoder_input_ids = torch.tensor([[model.config.decoder.decoder_start_token_id]]) outputs = model(pixel_values=pixel_values, decoder_input_ids=decoder_input_ids)
-
Consider Cloud GPUs:
- For large-scale tasks or faster processing, consider using cloud GPU services such as AWS, Google Cloud, or Azure to run your OCR models efficiently.
License
The model is released under terms specified by the original authors and the hosting platform. Ensure compliance with these terms when using and distributing the model.