en_core_web_trf LLM Model

Introduction

The en_core_web_trf model is an English language transformer pipeline from spaCy, primarily leveraging the roberta-base model. It is designed for a range of natural language processing tasks, including named entity recognition (NER), tagging, parsing, and more.

Architecture

The model utilizes a transformer with the following configuration:

Name: roberta-base
Components: Transformer, tagger, parser, NER, attribute ruler, lemmatizer
Configuration: Includes a piece encoder byte-bpe, a stride of 104, a width of 768, a window of 144, and a vocabulary size of 50,265.

The model is trained using data sources such as OntoNotes 5, ClearNLP Constituent-to-Dependency Conversion, WordNet 3.0, and roberta-base.

Training

The model achieves high accuracy across various tasks:

NER Precision: 0.9008
NER Recall: 0.9029
NER F Score: 0.9019
TAG (XPOS) Accuracy: 0.9813
Unlabeled Attachment Score (UAS): 0.9526
Labeled Attachment Score (LAS): 0.9390
Sentences F-Score: 0.9011

General token accuracy is very high, with a TOKEN_ACC of 99.86.

Guide: Running Locally

To run the en_core_web_trf model locally, follow these steps:

Install spaCy: Ensure spaCy is installed in your environment.
```
pip install spacy
```
Download the Model: Use spaCy’s CLI to download the model.
```
python -m spacy download en_core_web_trf
```
Load the Model: In your Python script, load the model.
```
import spacy
nlp = spacy.load("en_core_web_trf")
```

Process Text: Use the model to process text data.

doc = nlp("Your text here.")
for token in doc:
    print(token.text, token.pos_, token.dep_)

For better performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The en_core_web_trf model is distributed under the MIT License, which allows for flexible reuse and modification with proper attribution.

More Related APIs in Token Classification