trocr base stage1

microsoft

Introduction

The TrOCR model is a pre-trained, base-sized model designed for optical character recognition (OCR) tasks on single text-line images. It is developed as part of the research paper "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models" by Li et al. The model is available through Hugging Face's model hub.

Architecture

TrOCR is an encoder-decoder model consisting of two main components:

  • Image Transformer Encoder: Initializes from BEiT weights. It processes images by dividing them into fixed-size patches (16x16 resolution) that are linearly embedded. Absolute position embeddings are included before feeding into the Transformer encoder layers.
  • Text Transformer Decoder: Initializes from RoBERTa weights. It autoregressively generates text tokens from the processed image patches.

Training

The details provided do not include specific training instructions, as the model is pre-trained. However, the basic usage involves processing image input into pixel values and feeding them into the encoder-decoder architecture for OCR tasks.

Guide: Running Locally

To use the TrOCR model in a local environment with PyTorch, follow these steps:

  1. Install Dependencies: Ensure you have transformers, torch, and PIL packages installed.

    pip install transformers torch pillow
    
  2. Load and Process Image: Use the PIL library to load and convert the image to RGB format.

    from PIL import Image
    import requests
    
    url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    
  3. Initialize Processor and Model: Load the TrOCR processor and model from Hugging Face.

    from transformers import TrOCRProcessor, VisionEncoderDecoderModel
    
    processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-stage1')
    model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-stage1')
    
  4. Perform Inference: Convert the image to pixel values and generate text using the model.

    import torch
    
    pixel_values = processor(image, return_tensors="pt").pixel_values
    decoder_input_ids = torch.tensor([[model.config.decoder.decoder_start_token_id]])
    outputs = model(pixel_values=pixel_values, decoder_input_ids=decoder_input_ids)
    
  5. Cloud GPUs: For faster processing, consider using cloud GPU services such as AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.

License

The text does not specify a particular license, but models and code from Hugging Face are generally subject to their respective licenses as detailed on the repository or model page. Users should check the repository for specific license information.

More Related APIs in Image To Text