en_spacy_pii_distilbert
bekiIntroduction
The en_spacy_pii_distilbert
model is designed for token classification tasks, specifically Named Entity Recognition (NER) using the spaCy library. It focuses on identifying Personally Identifiable Information (PII) in English text.
Architecture
- Model: DistilBERT
- Library: spaCy
- Pipeline Components: Transformer and NER
- Label Scheme:
- DATE_TIME
- LOC (Location)
- NRP (Non-personal Names)
- ORG (Organization)
- PER (Person)
Training
The model was trained on the beki/privy
dataset, which is specifically curated for structured PII detection. The dataset was generated by Privy, a tool for generating PII data. The model achieved the following metrics:
- NER Precision: 95.30%
- NER Recall: 95.54%
- NER F Score: 95.42%
- Transformer Loss: 61154.85
- NER Loss: 56001.88
Guide: Running Locally
To run the en_spacy_pii_distilbert
model locally, follow these steps:
-
Install spaCy and Model:
pip install spacy pip install en_spacy_pii_distilbert==0.0.0
-
Load the Model:
import spacy nlp = spacy.load("en_spacy_pii_distilbert")
-
Run Inference:
doc = nlp("SELECT shipping FROM users WHERE shipping = '201 Thayer St Providence RI 02912'") for ent in doc.ents: print(ent.text, ent.label_)
For optimal performance, especially on large datasets, consider using cloud GPUs from providers such as AWS, Google Cloud, or Azure.
License
The en_spacy_pii_distilbert
model is provided under the MIT License. This allows for broad use and modification, subject to the license terms. The model was authored by Benjamin Kilimnik.