trocr base stage1
microsoftIntroduction
The TrOCR model is a pre-trained, base-sized model designed for optical character recognition (OCR) tasks on single text-line images. It is developed as part of the research paper "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models" by Li et al. The model is available through Hugging Face's model hub.
Architecture
TrOCR is an encoder-decoder model consisting of two main components:
- Image Transformer Encoder: Initializes from BEiT weights. It processes images by dividing them into fixed-size patches (16x16 resolution) that are linearly embedded. Absolute position embeddings are included before feeding into the Transformer encoder layers.
- Text Transformer Decoder: Initializes from RoBERTa weights. It autoregressively generates text tokens from the processed image patches.
Training
The details provided do not include specific training instructions, as the model is pre-trained. However, the basic usage involves processing image input into pixel values and feeding them into the encoder-decoder architecture for OCR tasks.
Guide: Running Locally
To use the TrOCR model in a local environment with PyTorch, follow these steps:
-
Install Dependencies: Ensure you have
transformers
,torch
, andPIL
packages installed.pip install transformers torch pillow
-
Load and Process Image: Use the PIL library to load and convert the image to RGB format.
from PIL import Image import requests url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg' image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
-
Initialize Processor and Model: Load the TrOCR processor and model from Hugging Face.
from transformers import TrOCRProcessor, VisionEncoderDecoderModel processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-stage1') model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-stage1')
-
Perform Inference: Convert the image to pixel values and generate text using the model.
import torch pixel_values = processor(image, return_tensors="pt").pixel_values decoder_input_ids = torch.tensor([[model.config.decoder.decoder_start_token_id]]) outputs = model(pixel_values=pixel_values, decoder_input_ids=decoder_input_ids)
-
Cloud GPUs: For faster processing, consider using cloud GPU services such as AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.
License
The text does not specify a particular license, but models and code from Hugging Face are generally subject to their respective licenses as detailed on the repository or model page. Users should check the repository for specific license information.