Introduction

The tr_core_news_md is a medium-sized Turkish NLP pipeline developed for spaCy. It incorporates multiple components such as tokenization, tagging, morphologizing, lemmatization, parsing, and named entity recognition (NER). The model supports token classification tasks and includes pre-trained word vectors derived from Turkish language datasets.

Architecture

The pipeline consists of the following components:

  • tok2vec: Converts tokens into vector representations.
  • tagger: Assigns part-of-speech tags to tokens.
  • morphologizer: Analyzes the morphological features of tokens.
  • trainable_lemmatizer: Lemmatizes tokens to their base forms.
  • parser: Constructs syntactic dependency trees.
  • ner: Identifies and classifies named entities in text.

The model uses 50,000 unique vectors with 300 dimensions and sources data from the UD Turkish BOUN, Turkish Wiki NER dataset, PANX/WikiANN, and medium-sized Turkish Floret word vectors.

Training

The model was trained using a variety of linguistic resources and datasets for Turkish, focusing on tasks such as NER, tagging, POS, morphological analysis, lemmatization, and dependency parsing. Performance metrics include:

  • NER Precision: 0.889
  • NER Recall: 0.890
  • NER F-Score: 0.889
  • TAG (XPOS) Accuracy: 0.914
  • POS (UPOS) Accuracy: 0.905
  • Morph (UFeats) Accuracy: 0.889
  • Lemma Accuracy: 0.817

Guide: Running Locally

To run the tr_core_news_md model on your local machine, follow these steps:

  1. Install spaCy: Ensure that you have spaCy installed in your environment.

    pip install spacy
    
  2. Download and Install the Model: Use pip to install the model from Hugging Face.

    pip install https://huggingface.co/turkish-nlp-suite/tr_core_news_md/resolve/main/tr_core_news_md-1.0-py3-none-any.whl
    
  3. Load the Model in spaCy: Load and use the model for your NLP tasks.

    import spacy
    nlp = spacy.load("tr_core_news_md")
    doc = nlp("Example text in Turkish.")
    
  4. Consider Cloud GPUs: For intensive processing tasks, consider using cloud-based GPUs to speed up the computation.

License

The tr_core_news_md model is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (cc-by-sa-4.0). Users are free to share and adapt the model, provided appropriate credit is given and any derivative works are licensed under the same terms.

More Related APIs in Token Classification