Introduction

The tr_core_news_trf is a Turkish transformer pipeline model designed for spaCy, focusing on token classification tasks such as Named Entity Recognition (NER), Part-Of-Speech (POS) tagging, morphological analysis, and more. It is tailored for the Turkish language and integrates components like transformers, taggers, morphologizers, lemmatizers, parsers, and NER systems.

Architecture

The pipeline comprises several components: a transformer for feature extraction, a tagger for POS tagging, a morphologizer for morphological analysis, a trainable lemmatizer, a parser for syntactic parsing, and a named entity recognizer (NER). The model is based on data from sources like UD Turkish BOUN, Turkish Wiki NER dataset, PANX/WikiANN, and the dbmdz Turkish BERT model. It is compatible with spaCy version 3.4.2 and follows the cc-by-sa-4.0 license.

Training

The model achieves high performance in various token classification tasks, with metrics indicating a NER precision of 0.9135, NER recall of 0.9127, and an F-score of 0.9131. POS tagging accuracy is 0.9094, and morphological analysis accuracy is 0.9145. The training data includes a mix of Wikipedia articles, customer reviews, and other diverse genres, providing a broad linguistic base.

Guide: Running Locally

  1. Installation:

    • Due to changes in setup tools, the model can be installed using pip with the following command:
      pip install https://huggingface.co/turkish-nlp-suite/tr_core_news_trf/resolve/main/tr_core_news_trf-1.0-py3-none-any.whl
      
  2. Requirements:

    • Ensure you have Python 3 and spaCy installed.
  3. Setup:

    • Load the model in your spaCy environment using:
      import spacy
      nlp = spacy.load("tr_core_news_trf")
      
  4. Cloud GPUs:

    • For efficient model execution, particularly on large datasets, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

The model is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (cc-by-sa-4.0), allowing for sharing and adaptation with proper attribution.

More Related APIs in Token Classification