byt5 base english ocr correction

yelpfeast

BYT5-Base-English-OCR-Correction

Introduction

BYT5-Base-English-OCR-Correction is a fine-tuned version of the ByT5 model designed specifically for correcting errors from Optical Character Recognition (OCR) systems. The model takes input sentences with OCR errors and outputs corrected versions. This fine-tuning process utilized the wikitext dataset with synthetic OCR errors generated using the nlpaug library.

Architecture

The model is based on the ByT5 architecture, as detailed in the paper arxiv:2105.13626. It leverages the capabilities of the T5 model, a transformer-based model for text-to-text tasks, and has been specifically adjusted to handle OCR error correction.

Training

The model was fine-tuned using the wikitext dataset, which was enhanced with synthetic OCR errors. The nlpaug library was employed to introduce these errors. This approach allows the model to learn to correct typical OCR transcription mistakes.

Guide: Running Locally

  1. Install Required Libraries: Ensure that transformers, torch, and nlpaug are installed in your environment.
  2. Load Model and Tokenizer:
    from transformers import T5ForConditionalGeneration, AutoTokenizer
    import nlpaug.augmenter.char as nac
    
    model = T5ForConditionalGeneration.from_pretrained('yelpfeast/byt5-base-english-ocr-correction')
    tokenizer = AutoTokenizer.from_pretrained("yelpfeast/byt5-base-english-ocr-correction")
    
  3. Augment and Correct Text:
    aug = nac.OcrAug(aug_char_p=0.4, aug_word_p=0.6)
    corrected_text = "Life is like a box of chocolates"
    augmented_text = aug.augment(corrected_text)
    
    inputs = tokenizer(augmented_text, return_tensors="pt", padding=True)
    output_sequences = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], do_sample=False)
    
    print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
    
  4. Cloud GPUs: For large-scale processing or faster inference, consider using cloud-based GPU services like AWS, GCP, or Azure.

License

The model is available under the licenses provided by Hugging Face. Please review the specific terms and conditions on the Hugging Face website.

More Related APIs in Text2text Generation