en_core_web_trf

spacy

Introduction

The en_core_web_trf model is an English language transformer pipeline from spaCy, primarily leveraging the roberta-base model. It is designed for a range of natural language processing tasks, including named entity recognition (NER), tagging, parsing, and more.

Architecture

The model utilizes a transformer with the following configuration:

  • Name: roberta-base
  • Components: Transformer, tagger, parser, NER, attribute ruler, lemmatizer
  • Configuration: Includes a piece encoder byte-bpe, a stride of 104, a width of 768, a window of 144, and a vocabulary size of 50,265.

The model is trained using data sources such as OntoNotes 5, ClearNLP Constituent-to-Dependency Conversion, WordNet 3.0, and roberta-base.

Training

The model achieves high accuracy across various tasks:

  • NER Precision: 0.9008
  • NER Recall: 0.9029
  • NER F Score: 0.9019
  • TAG (XPOS) Accuracy: 0.9813
  • Unlabeled Attachment Score (UAS): 0.9526
  • Labeled Attachment Score (LAS): 0.9390
  • Sentences F-Score: 0.9011

General token accuracy is very high, with a TOKEN_ACC of 99.86.

Guide: Running Locally

To run the en_core_web_trf model locally, follow these steps:

  1. Install spaCy: Ensure spaCy is installed in your environment.

    pip install spacy
    
  2. Download the Model: Use spaCy’s CLI to download the model.

    python -m spacy download en_core_web_trf
    
  3. Load the Model: In your Python script, load the model.

    import spacy
    nlp = spacy.load("en_core_web_trf")
    
  4. Process Text: Use the model to process text data.

    doc = nlp("Your text here.")
    for token in doc:
        print(token.text, token.pos_, token.dep_)
    

For better performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The en_core_web_trf model is distributed under the MIT License, which allows for flexible reuse and modification with proper attribution.

More Related APIs in Token Classification