trocr large str

microsoft

Introduction
TrOCR is a large-sized model fine-tuned on standard scene text recognition benchmarks like IC13, IC15, IIIT5K, and SVT. Developed by Microsoft, this model belongs to the category of Transformer-based Optical Character Recognition (OCR) models. It is designed to process images of text and convert them into digital text efficiently.

Architecture
The TrOCR model follows an encoder-decoder architecture, where the encoder is a vision Transformer initialized with BEiT weights, and the decoder is a text Transformer initialized with RoBERTa weights. The model processes images by first converting them into sequences of fixed-size patches (16x16 resolution), which are linearly embedded and supplemented with absolute position embeddings. These sequences are then fed into the Transformer encoder, while the text decoder generates tokens autoregressively.

Training
The TrOCR model was fine-tuned using datasets from IC13, IC15, IIIT5K, and SVT, which are standard OCR benchmarks. The model's training involved leveraging pre-trained weights from BEiT for the image encoder and RoBERTa for the text decoder, optimizing it for OCR tasks.

Guide: Running Locally
To use TrOCR in a local environment with PyTorch, follow these steps:

  1. Install Dependencies: Ensure you have the Transformers library installed.

    pip install transformers
    
  2. Load and Process Image: Use an image from the dataset.

    from transformers import TrOCRProcessor, VisionEncoderDecoderModel
    from PIL import Image
    import requests
    
    url = 'https://i.postimg.cc/ZKwLg2Gw/367-14.png'
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    
  3. Initialize and Run Model:

    processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-str')
    model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-str')
    pixel_values = processor(images=image, return_tensors="pt").pixel_values
    
    generated_ids = model.generate(pixel_values)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
  4. Output: The processed text from the image will be stored in generated_text.

For efficiency, consider using cloud GPUs like those provided by AWS or Google Cloud to handle large-scale image processing tasks.

License
For specific licensing details, refer to the original repository or Hugging Face model card for any usage restrictions or licensing agreements.

More Related APIs in Image To Text