xx_sent_ud_sm

spacy

Introduction

The xx_sent_ud_sm model is a multilingual spaCy pipeline optimized for CPU usage. It primarily focuses on sentence segmentation, leveraging Universal Dependencies datasets to support a wide range of languages.

Architecture

The model is part of spaCy's suite and includes the component senter for sentence segmentation. It does not include any pre-trained word vectors, emphasizing lightweight and efficient processing for CPU environments. The model is built with spaCy version 3.7.0 and is compatible with versions >=3.7.0 and <3.8.0.

Training

This model utilizes various Universal Dependencies datasets, covering numerous languages such as Afrikaans, Croatian, Czech, Danish, Dutch, English, Finnish, French, German, and more. It achieves a Sentence F-Score of 85.88, with precision and recall scores of 90.66 and 81.58, respectively. The focus on token accuracy and sentence segmentation makes it suitable for multilingual token classification tasks.

Guide: Running Locally

To run the xx_sent_ud_sm model locally, follow these steps:

  1. Install spaCy: Ensure you have spaCy installed. You can install it using pip:

    pip install spacy
    
  2. Install the Model: Download and install the xx_sent_ud_sm model:

    python -m spacy download xx_sent_ud_sm
    
  3. Load and Use the Model: In your Python script, load the model and use it for sentence segmentation:

    import spacy
    nlp = spacy.load("xx_sent_ud_sm")
    doc = nlp("Your text here.")
    for sent in doc.sents:
        print(sent.text)
    
  4. Cloud GPUs: For enhanced performance, consider using cloud-based GPU services like AWS, GCP, or Azure. They can provide scalable resources to process large datasets efficiently.

License

The xx_sent_ud_sm model is released under the Creative Commons Attribution-ShareAlike 3.0 License (CC BY-SA 3.0). This allows for sharing and adaptation with appropriate credit and distribution under the same license.

More Related APIs