pl_core_news_lg
spacyIntroduction
The pl_core_news_lg
model is a large Polish language pipeline for spaCy, designed for various natural language processing tasks such as token classification and named entity recognition (NER). It is optimized for CPU usage and includes components like token vectorization, morphological analysis, parsing, lemmatization, tagging, sentence segmentation, and entity recognition.
Architecture
The model features the following components:
- tok2vec: Converts tokens into vectors.
- morphologizer: Analyzes morphological attributes.
- parser: Identifies syntactic dependencies.
- lemmatizer: Provides base forms of words.
- tagger: Assigns part-of-speech tags.
- senter: Segments sentences.
- ner: Identifies named entities.
The model includes 500,000 unique vectors with 300 dimensions each, sourced from the UD Polish PDB v2.8 and the National Corpus of Polish.
Training
The model's performance metrics across various tasks are as follows:
- NER Precision: 0.847
- NER Recall: 0.836
- NER F Score: 0.841
- TAG (XPOS) Accuracy: 0.983
- POS (UPOS) Accuracy: 0.978
- Morph (UFeats) Accuracy: 0.910
- Lemma Accuracy: 0.942
- Unlabeled Attachment Score (UAS): 0.895
- Labeled Attachment Score (LAS): 0.824
- Sentences F-Score: 0.963
Guide: Running Locally
- Install spaCy: Ensure you have Python and pip installed. Run
pip install spacy
. - Download the model: Use the command
python -m spacy download pl_core_news_lg
. - Load the model: In Python, load the model using
nlp = spacy.load("pl_core_news_lg")
. - Process text: Use the model to process Polish text with
doc = nlp("Your Polish text here.")
.
Suggested Cloud GPUs
To accelerate processing, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure. These services offer scalable options for handling larger datasets or batch processing tasks.
License
The model is licensed under GNU General Public License v3.0. This means it is free to use, modify, and distribute, provided that all copies and derivatives remain under the same license.