trocr large str
microsoftIntroduction
TrOCR is a large-sized model fine-tuned on standard scene text recognition benchmarks like IC13, IC15, IIIT5K, and SVT. Developed by Microsoft, this model belongs to the category of Transformer-based Optical Character Recognition (OCR) models. It is designed to process images of text and convert them into digital text efficiently.
Architecture
The TrOCR model follows an encoder-decoder architecture, where the encoder is a vision Transformer initialized with BEiT weights, and the decoder is a text Transformer initialized with RoBERTa weights. The model processes images by first converting them into sequences of fixed-size patches (16x16 resolution), which are linearly embedded and supplemented with absolute position embeddings. These sequences are then fed into the Transformer encoder, while the text decoder generates tokens autoregressively.
Training
The TrOCR model was fine-tuned using datasets from IC13, IC15, IIIT5K, and SVT, which are standard OCR benchmarks. The model's training involved leveraging pre-trained weights from BEiT for the image encoder and RoBERTa for the text decoder, optimizing it for OCR tasks.
Guide: Running Locally
To use TrOCR in a local environment with PyTorch, follow these steps:
-
Install Dependencies: Ensure you have the Transformers library installed.
pip install transformers
-
Load and Process Image: Use an image from the dataset.
from transformers import TrOCRProcessor, VisionEncoderDecoderModel from PIL import Image import requests url = 'https://i.postimg.cc/ZKwLg2Gw/367-14.png' image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
-
Initialize and Run Model:
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-str') model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-str') pixel_values = processor(images=image, return_tensors="pt").pixel_values generated_ids = model.generate(pixel_values) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
-
Output: The processed text from the image will be stored in
generated_text
.
For efficiency, consider using cloud GPUs like those provided by AWS or Google Cloud to handle large-scale image processing tasks.
License
For specific licensing details, refer to the original repository or Hugging Face model card for any usage restrictions or licensing agreements.