bert restore punctuation

felflare

Introduction

BERT-RESTORE-PUNCTUATION is a fine-tuned BERT model designed for punctuation restoration in the English language, particularly useful for text outputs that lack punctuation, such as automatic speech recognition (ASR) systems. The model restores punctuation marks like !, ?, ., ,, -, :, ;, ', and also manages the upper-casing of words.

Architecture

The model is based on bert-base-uncased, finetuned specifically on the Yelp Reviews dataset to restore punctuation and upper-casing in plain, lowercase text. It uses the Transformers library with PyTorch support.

Training

The model was fine-tuned on 560,000 English text samples from the Yelp Reviews dataset. Optimal performance was achieved with 3 epochs. Performance metrics for the model include an overall accuracy of 91% and an F1 score of 90% on a held-out sample set of 45,990 texts, with detailed precision and recall scores for specific punctuation marks.

Guide: Running Locally

  1. Installation: Use the following command to install necessary packages:

    pip install rpunct
    
  2. Usage: Run the following sample Python code to restore punctuation:

    from rpunct import RestorePuncts
    rpunct = RestorePuncts()
    result = rpunct.punctuate("your text here")
    print(result)
    

    This code works with arbitrarily large English text inputs and utilizes a GPU if available. For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

This model is released under the MIT License, which permits reuse, modification, and distribution with minimal restrictions.

More Related APIs in Token Classification