fullstop punctuation multilang large

oliverguhr

Introduction

The FullStop model, developed by Oliver Guhr, is a multilingual deep learning model designed for punctuation prediction. It is capable of restoring punctuation in transcribed spoken language for English, Italian, French, and German texts. The model utilizes the Europarl Dataset, which comprises political speeches, and it may not perform consistently across other text domains.

Architecture

The FullStop model is based on the XLM-RoBERTa architecture and supports multiple libraries, including Transformers, PyTorch, TensorFlow, and ONNX. It classifies tokens for punctuation prediction, restoring markers such as periods, commas, question marks, hyphens, and colons.

Training

The model was trained on the Europarl Dataset provided by the SEPP-NLG Shared Task. It achieves high F1 scores across various languages but may vary in performance for different punctuation types due to their optional nature in some contexts.

Guide: Running Locally

  1. Install the Package:
    Install the required Python package using pip:

    pip install deepmultilingualpunctuation
    
  2. Restore Punctuation:
    Use the following code to restore punctuation in a text:

    from deepmultilingualpunctuation import PunctuationModel
    
    model = PunctuationModel()
    text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
    result = model.restore_punctuation(text)
    print(result)
    
  3. Predict Labels:
    The model can also predict punctuation labels for each word:

    from deepmultilingualpunctuation import PunctuationModel
    
    model = PunctuationModel()
    text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
    clean_text = model.preprocess(text)
    labled_words = model.predict(clean_text)
    print(labled_words)
    
  4. Cloud GPUs:
    For large-scale or resource-intensive tasks, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The FullStop model is licensed under the MIT License, allowing for flexibility in usage and modification.

More Related APIs in Token Classification