punctuate all

kredor

Punctuate-All

Introduction

Punctuate-All is a model based on Oliver Guhr's work, specifically fine-tuned on the xlm-roberta-base architecture. It supports punctuation restoration across twelve languages: English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portuguese, Slovak, and Slovenian. This model is an extension of the original work, which supported four languages.

Architecture

The model utilizes the xlm-roberta-base transformer from Hugging Face's Transformers library. Punctuate-All is designed to classify tokens for punctuation insertion, leveraging multilingual capabilities by supporting twelve different languages. The datasets used for training include wmt/europarl.

Training

The model has been evaluated using metrics such as F1-score, recall, and precision. The training results demonstrate high performance with an overall accuracy of 98%. The precision, recall, and F1 scores for various punctuation marks are as follows:

  • .: Precision 0.94, Recall 0.95, F1 0.95
  • ,: Precision 0.86, Recall 0.86, F1 0.86
  • ?: Precision 0.88, Recall 0.85, F1 0.86
  • -: Precision 0.60, Recall 0.29, F1 0.39
  • :: Precision 0.71, Recall 0.49, F1 0.58

Guide: Running Locally

  1. Clone the Repository: Clone the model repository from Hugging Face.
    git clone https://huggingface.co/kredor/punctuate-all
    
  2. Install Dependencies: Ensure you have Python and the required libraries. You can install dependencies using:
    pip install transformers torch
    
  3. Load the Model: Use the Transformers library to load the model.
    from transformers import pipeline
    
    model_name = "kredor/punctuate-all"
    nlp = pipeline("token-classification", model=model_name)
    
  4. Run Inference: Use the model to predict punctuation for given text inputs.
    text = "your text here"
    result = nlp(text)
    print(result)
    

For enhanced performance, especially with large datasets, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

Punctuate-All is released under the MIT License. This allows for flexibility in use, modification, and distribution, provided the original license terms are maintained.

More Related APIs in Token Classification