bert restore punctuation ptbr
dominguesmIntroduction
The BERT-RESTORE-PUNCTUATION-PTBR model is a variant of the bert-base-portuguese-cased model, fine-tuned for punctuation restoration in Portuguese texts. It is trained to restore punctuation marks such as ! ? . , - : ; '
and to correct word casing in text. This model is useful for general Portuguese language punctuation restoration and can be further fine-tuned for domain-specific applications.
Architecture
The model is based on the BERT architecture, specifically the bert-base-portuguese-cased variant. It leverages Hugging Face's Transformers library and is implemented in PyTorch. The model is trained on the WikiLingua dataset and is designed for named-entity-recognition tasks.
Training
The model was fine-tuned using WikiLingua, a comprehensive dataset for various language tasks. It achieves performance metrics of F1 Score: 55.7, Precision: 57.72, and Recall: 53.83. It is optimized to restore punctuation and casing in Portuguese sentences, with detailed accuracy evaluations provided for various punctuation marks.
Guide: Running Locally
-
Install the package:
pip install respunct
-
Sample Python code:
from respunct import RestorePuncts model = RestorePuncts() print(model.restore_puncts(""" henrique foi no lago pescar com o pedro mais tarde foram para a casa do pedro fritar os peixes""")) # Output: Henrique foi no lago pescar com o Pedro. Mais tarde, foram para a casa do Pedro fritar os peixes.
-
Suggested Cloud GPUs: For enhanced performance, consider using cloud platforms like AWS, Google Cloud, or Azure which offer GPU support for model inference.
License
The model is licensed under the Creative Commons Attribution 4.0 International License (cc-by-4.0), allowing for sharing and adaptation with appropriate credit.