piiranha v1 detect personal information

iiiorg

Introduction

Piiranha is a model designed to detect 17 types of Personally Identifiable Information (PII) across six languages. It is highly accurate in identifying sensitive information like passwords, emails, phone numbers, and usernames. With a classification accuracy of 99.44%, it is particularly effective for redaction tasks.

Architecture

Piiranha is a fine-tuned version of Microsoft's DeBERTa-v3-base model, which supports a context length of 256 tokens. The model is trained to recognize PII in English, Spanish, French, German, Italian, and Dutch.

Training

The model was trained using a dataset of approximately 73,000 sentences containing PII, achieving high precision (93.16%), recall (93.08%), and F1-score (93.12%). Training was conducted using H100 GPUs sponsored by the Akash Network, with the following hyperparameters:

  • Learning rate: 5e-05
  • Batch size: 128
  • Optimizer: Adam
  • Number of epochs: 5

Guide: Running Locally

To run Piiranha locally, follow these steps:

  1. Clone the repository from Hugging Face.
  2. Install the required dependencies:
    pip install transformers torch datasets
    
  3. Load the model and tokenizer using the Transformers library:
    from transformers import AutoTokenizer, AutoModelForTokenClassification
    tokenizer = AutoTokenizer.from_pretrained("iiiorg/piiranha-v1-detect-personal-information")
    model = AutoModelForTokenClassification.from_pretrained("iiiorg/piiranha-v1-detect-personal-information")
    
  4. Use the model to detect PII in your text.

For optimal performance, consider using a cloud GPU service such as AWS EC2, Google Cloud, or Azure.

License

Piiranha is released under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (cc-by-nc-nd-4.0). This means you can use the model non-commercially, but cannot distribute derivatives.

More Related APIs in Token Classification