Punctuate-All

Introduction

Punctuate-All is a model based on Oliver Guhr's work, specifically fine-tuned on the xlm-roberta-base architecture. It supports punctuation restoration across twelve languages: English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portuguese, Slovak, and Slovenian. This model is an extension of the original work, which supported four languages.

Architecture

The model utilizes the xlm-roberta-base transformer from Hugging Face's Transformers library. Punctuate-All is designed to classify tokens for punctuation insertion, leveraging multilingual capabilities by supporting twelve different languages. The datasets used for training include wmt/europarl.

Training

The model has been evaluated using metrics such as F1-score, recall, and precision. The training results demonstrate high performance with an overall accuracy of 98%. The precision, recall, and F1 scores for various punctuation marks are as follows:

.: Precision 0.94, Recall 0.95, F1 0.95
,: Precision 0.86, Recall 0.86, F1 0.86
?: Precision 0.88, Recall 0.85, F1 0.86
-: Precision 0.60, Recall 0.29, F1 0.39
:: Precision 0.71, Recall 0.49, F1 0.58

Guide: Running Locally

Clone the Repository: Clone the model repository from Hugging Face.
```
git clone https://huggingface.co/kredor/punctuate-all
```
Install Dependencies: Ensure you have Python and the required libraries. You can install dependencies using:
```
pip install transformers torch
```

Load the Model: Use the Transformers library to load the model.

from transformers import pipeline

model_name = "kredor/punctuate-all"
nlp = pipeline("token-classification", model=model_name)

Run Inference: Use the model to predict punctuation for given text inputs.
```
text = "your text here"
result = nlp(text)
print(result)
```

For enhanced performance, especially with large datasets, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

Punctuate-All is released under the MIT License. This allows for flexibility in use, modification, and distribution, provided the original license terms are maintained.