typo detector distilbert en
m3hrdadfiIntroduction
The Typo Detector DistilBERT EN is a token classification model designed to detect typographical errors in English text. It is built using the DistilBERT transformer architecture and is compatible with both PyTorch and TensorFlow libraries.
Architecture
This model utilizes the DistilBERT architecture, a distilled version of BERT, which is optimized for efficiency and speed. It is designed for token classification tasks, specifically to identify and highlight typographical errors in sentences.
Training
The model is trained using the NeuSpell dataset, a comprehensive corpus for spelling correction tasks. Evaluation metrics include precision, recall, and F1-score, all achieving a score of approximately 0.989. This indicates the model's high accuracy in detecting typos.
Guide: Running Locally
Basic Steps
-
Install Requirements:
Ensure you have thetransformers
library installed:pip install transformers
-
Load the Model:
Use the Transformers library to load the model and tokenizer:import torch from transformers import AutoConfig, AutoTokenizer, AutoModelForTokenClassification, pipeline model_name_or_path = "m3hrdadfi/typo-detector-distilbert-en" config = AutoConfig.from_pretrained(model_name_or_path) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) model = AutoModelForTokenClassification.from_pretrained(model_name_or_path, config=config) nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="average")
-
Run Predictions:
Prepare and run sentences through the model to detect typos:sentences = [ "He had also stgruggled with addiction during his time in Congress .", # Add more sentences as needed ] for sentence in sentences: typos = [sentence[r["start"]: r["end"]] for r in nlp(sentence)] detected = sentence for typo in typos: detected = detected.replace(typo, f'<i>{typo}</i>') print(" [Input]: ", sentence) print("[Detected]: ", detected) print("-" * 130)
Cloud GPUs
For enhanced performance, consider using cloud GPU services such as Google Cloud Platform, AWS, or Azure for faster inference times, especially when processing large datasets.
License
For licensing information, please refer to the model's repository on Hugging Face or the TypoDetector Issues page for inquiries.