pl_core_news_lg

spacy

Introduction

The pl_core_news_lg model is a large Polish language pipeline for spaCy, designed for various natural language processing tasks such as token classification and named entity recognition (NER). It is optimized for CPU usage and includes components like token vectorization, morphological analysis, parsing, lemmatization, tagging, sentence segmentation, and entity recognition.

Architecture

The model features the following components:

  • tok2vec: Converts tokens into vectors.
  • morphologizer: Analyzes morphological attributes.
  • parser: Identifies syntactic dependencies.
  • lemmatizer: Provides base forms of words.
  • tagger: Assigns part-of-speech tags.
  • senter: Segments sentences.
  • ner: Identifies named entities.

The model includes 500,000 unique vectors with 300 dimensions each, sourced from the UD Polish PDB v2.8 and the National Corpus of Polish.

Training

The model's performance metrics across various tasks are as follows:

  • NER Precision: 0.847
  • NER Recall: 0.836
  • NER F Score: 0.841
  • TAG (XPOS) Accuracy: 0.983
  • POS (UPOS) Accuracy: 0.978
  • Morph (UFeats) Accuracy: 0.910
  • Lemma Accuracy: 0.942
  • Unlabeled Attachment Score (UAS): 0.895
  • Labeled Attachment Score (LAS): 0.824
  • Sentences F-Score: 0.963

Guide: Running Locally

  1. Install spaCy: Ensure you have Python and pip installed. Run pip install spacy.
  2. Download the model: Use the command python -m spacy download pl_core_news_lg.
  3. Load the model: In Python, load the model using nlp = spacy.load("pl_core_news_lg").
  4. Process text: Use the model to process Polish text with doc = nlp("Your Polish text here.").

Suggested Cloud GPUs

To accelerate processing, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure. These services offer scalable options for handling larger datasets or batch processing tasks.

License

The model is licensed under GNU General Public License v3.0. This means it is free to use, modify, and distribute, provided that all copies and derivatives remain under the same license.

More Related APIs in Token Classification