punctuate all
kredorPunctuate-All
Introduction
Punctuate-All is a model based on Oliver Guhr's work, specifically fine-tuned on the xlm-roberta-base architecture. It supports punctuation restoration across twelve languages: English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portuguese, Slovak, and Slovenian. This model is an extension of the original work, which supported four languages.
Architecture
The model utilizes the xlm-roberta-base
transformer from Hugging Face's Transformers library. Punctuate-All is designed to classify tokens for punctuation insertion, leveraging multilingual capabilities by supporting twelve different languages. The datasets used for training include wmt/europarl
.
Training
The model has been evaluated using metrics such as F1-score, recall, and precision. The training results demonstrate high performance with an overall accuracy of 98%. The precision, recall, and F1 scores for various punctuation marks are as follows:
.
: Precision 0.94, Recall 0.95, F1 0.95,
: Precision 0.86, Recall 0.86, F1 0.86?
: Precision 0.88, Recall 0.85, F1 0.86-
: Precision 0.60, Recall 0.29, F1 0.39:
: Precision 0.71, Recall 0.49, F1 0.58
Guide: Running Locally
- Clone the Repository: Clone the model repository from Hugging Face.
git clone https://huggingface.co/kredor/punctuate-all
- Install Dependencies: Ensure you have Python and the required libraries. You can install dependencies using:
pip install transformers torch
- Load the Model: Use the Transformers library to load the model.
from transformers import pipeline model_name = "kredor/punctuate-all" nlp = pipeline("token-classification", model=model_name)
- Run Inference: Use the model to predict punctuation for given text inputs.
text = "your text here" result = nlp(text) print(result)
For enhanced performance, especially with large datasets, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
Punctuate-All is released under the MIT License. This allows for flexibility in use, modification, and distribution, provided the original license terms are maintained.