xx_sent_ud_sm
spacyIntroduction
The xx_sent_ud_sm
model is a multilingual spaCy pipeline optimized for CPU usage. It primarily focuses on sentence segmentation, leveraging Universal Dependencies datasets to support a wide range of languages.
Architecture
The model is part of spaCy's suite and includes the component senter
for sentence segmentation. It does not include any pre-trained word vectors, emphasizing lightweight and efficient processing for CPU environments. The model is built with spaCy version 3.7.0 and is compatible with versions >=3.7.0 and <3.8.0.
Training
This model utilizes various Universal Dependencies datasets, covering numerous languages such as Afrikaans, Croatian, Czech, Danish, Dutch, English, Finnish, French, German, and more. It achieves a Sentence F-Score of 85.88, with precision and recall scores of 90.66 and 81.58, respectively. The focus on token accuracy and sentence segmentation makes it suitable for multilingual token classification tasks.
Guide: Running Locally
To run the xx_sent_ud_sm
model locally, follow these steps:
-
Install spaCy: Ensure you have spaCy installed. You can install it using pip:
pip install spacy
-
Install the Model: Download and install the
xx_sent_ud_sm
model:python -m spacy download xx_sent_ud_sm
-
Load and Use the Model: In your Python script, load the model and use it for sentence segmentation:
import spacy nlp = spacy.load("xx_sent_ud_sm") doc = nlp("Your text here.") for sent in doc.sents: print(sent.text)
-
Cloud GPUs: For enhanced performance, consider using cloud-based GPU services like AWS, GCP, or Azure. They can provide scalable resources to process large datasets efficiently.
License
The xx_sent_ud_sm
model is released under the Creative Commons Attribution-ShareAlike 3.0 License (CC BY-SA 3.0). This allows for sharing and adaptation with appropriate credit and distribution under the same license.