byt5 base english ocr correction
yelpfeastBYT5-Base-English-OCR-Correction
Introduction
BYT5-Base-English-OCR-Correction is a fine-tuned version of the ByT5 model designed specifically for correcting errors from Optical Character Recognition (OCR) systems. The model takes input sentences with OCR errors and outputs corrected versions. This fine-tuning process utilized the wikitext dataset with synthetic OCR errors generated using the nlpaug library.
Architecture
The model is based on the ByT5 architecture, as detailed in the paper arxiv:2105.13626. It leverages the capabilities of the T5 model, a transformer-based model for text-to-text tasks, and has been specifically adjusted to handle OCR error correction.
Training
The model was fine-tuned using the wikitext dataset, which was enhanced with synthetic OCR errors. The nlpaug library was employed to introduce these errors. This approach allows the model to learn to correct typical OCR transcription mistakes.
Guide: Running Locally
- Install Required Libraries: Ensure that
transformers
,torch
, andnlpaug
are installed in your environment. - Load Model and Tokenizer:
from transformers import T5ForConditionalGeneration, AutoTokenizer import nlpaug.augmenter.char as nac model = T5ForConditionalGeneration.from_pretrained('yelpfeast/byt5-base-english-ocr-correction') tokenizer = AutoTokenizer.from_pretrained("yelpfeast/byt5-base-english-ocr-correction")
- Augment and Correct Text:
aug = nac.OcrAug(aug_char_p=0.4, aug_word_p=0.6) corrected_text = "Life is like a box of chocolates" augmented_text = aug.augment(corrected_text) inputs = tokenizer(augmented_text, return_tensors="pt", padding=True) output_sequences = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], do_sample=False) print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
- Cloud GPUs: For large-scale processing or faster inference, consider using cloud-based GPU services like AWS, GCP, or Azure.
License
The model is available under the licenses provided by Hugging Face. Please review the specific terms and conditions on the Hugging Face website.