en_core_web_trf
spacyIntroduction
The en_core_web_trf
model is an English language transformer pipeline from spaCy, primarily leveraging the roberta-base
model. It is designed for a range of natural language processing tasks, including named entity recognition (NER), tagging, parsing, and more.
Architecture
The model utilizes a transformer with the following configuration:
- Name:
roberta-base
- Components: Transformer, tagger, parser, NER, attribute ruler, lemmatizer
- Configuration: Includes a piece encoder
byte-bpe
, a stride of 104, a width of 768, a window of 144, and a vocabulary size of 50,265.
The model is trained using data sources such as OntoNotes 5, ClearNLP Constituent-to-Dependency Conversion, WordNet 3.0, and roberta-base.
Training
The model achieves high accuracy across various tasks:
- NER Precision: 0.9008
- NER Recall: 0.9029
- NER F Score: 0.9019
- TAG (XPOS) Accuracy: 0.9813
- Unlabeled Attachment Score (UAS): 0.9526
- Labeled Attachment Score (LAS): 0.9390
- Sentences F-Score: 0.9011
General token accuracy is very high, with a TOKEN_ACC of 99.86.
Guide: Running Locally
To run the en_core_web_trf
model locally, follow these steps:
-
Install spaCy: Ensure spaCy is installed in your environment.
pip install spacy
-
Download the Model: Use spaCy’s CLI to download the model.
python -m spacy download en_core_web_trf
-
Load the Model: In your Python script, load the model.
import spacy nlp = spacy.load("en_core_web_trf")
-
Process Text: Use the model to process text data.
doc = nlp("Your text here.") for token in doc: print(token.text, token.pos_, token.dep_)
For better performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The en_core_web_trf
model is distributed under the MIT License, which allows for flexible reuse and modification with proper attribution.