en_spacy_pii_distilbert

beki

Introduction

The en_spacy_pii_distilbert model is designed for token classification tasks, specifically Named Entity Recognition (NER) using the spaCy library. It focuses on identifying Personally Identifiable Information (PII) in English text.

Architecture

  • Model: DistilBERT
  • Library: spaCy
  • Pipeline Components: Transformer and NER
  • Label Scheme:
    • DATE_TIME
    • LOC (Location)
    • NRP (Non-personal Names)
    • ORG (Organization)
    • PER (Person)

Training

The model was trained on the beki/privy dataset, which is specifically curated for structured PII detection. The dataset was generated by Privy, a tool for generating PII data. The model achieved the following metrics:

  • NER Precision: 95.30%
  • NER Recall: 95.54%
  • NER F Score: 95.42%
  • Transformer Loss: 61154.85
  • NER Loss: 56001.88

Guide: Running Locally

To run the en_spacy_pii_distilbert model locally, follow these steps:

  1. Install spaCy and Model:

    pip install spacy
    pip install en_spacy_pii_distilbert==0.0.0
    
  2. Load the Model:

    import spacy
    nlp = spacy.load("en_spacy_pii_distilbert")
    
  3. Run Inference:

    doc = nlp("SELECT shipping FROM users WHERE shipping = '201 Thayer St Providence RI 02912'")
    for ent in doc.ents:
        print(ent.text, ent.label_)
    

For optimal performance, especially on large datasets, consider using cloud GPUs from providers such as AWS, Google Cloud, or Azure.

License

The en_spacy_pii_distilbert model is provided under the MIT License. This allows for broad use and modification, subject to the license terms. The model was authored by Benjamin Kilimnik.

More Related APIs in Token Classification