typo detector distilbert en

m3hrdadfi

Introduction

The Typo Detector DistilBERT EN is a token classification model designed to detect typographical errors in English text. It is built using the DistilBERT transformer architecture and is compatible with both PyTorch and TensorFlow libraries.

Architecture

This model utilizes the DistilBERT architecture, a distilled version of BERT, which is optimized for efficiency and speed. It is designed for token classification tasks, specifically to identify and highlight typographical errors in sentences.

Training

The model is trained using the NeuSpell dataset, a comprehensive corpus for spelling correction tasks. Evaluation metrics include precision, recall, and F1-score, all achieving a score of approximately 0.989. This indicates the model's high accuracy in detecting typos.

Guide: Running Locally

Basic Steps

  1. Install Requirements:
    Ensure you have the transformers library installed:

    pip install transformers
    
  2. Load the Model:
    Use the Transformers library to load the model and tokenizer:

    import torch
    from transformers import AutoConfig, AutoTokenizer, AutoModelForTokenClassification, pipeline
    
    model_name_or_path = "m3hrdadfi/typo-detector-distilbert-en"
    config = AutoConfig.from_pretrained(model_name_or_path)
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
    model = AutoModelForTokenClassification.from_pretrained(model_name_or_path, config=config)
    nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="average")
    
  3. Run Predictions:
    Prepare and run sentences through the model to detect typos:

    sentences = [
        "He had also stgruggled with addiction during his time in Congress .",
        # Add more sentences as needed
    ]
    
    for sentence in sentences:
        typos = [sentence[r["start"]: r["end"]] for r in nlp(sentence)]
        detected = sentence
        for typo in typos:
            detected = detected.replace(typo, f'<i>{typo}</i>')
        print("   [Input]: ", sentence)
        print("[Detected]: ", detected)
        print("-" * 130)
    

Cloud GPUs

For enhanced performance, consider using cloud GPU services such as Google Cloud Platform, AWS, or Azure for faster inference times, especially when processing large datasets.

License

For licensing information, please refer to the model's repository on Hugging Face or the TypoDetector Issues page for inquiries.

More Related APIs in Token Classification