piiranha v1 detect personal information
iiiorgIntroduction
Piiranha is a model designed to detect 17 types of Personally Identifiable Information (PII) across six languages. It is highly accurate in identifying sensitive information like passwords, emails, phone numbers, and usernames. With a classification accuracy of 99.44%, it is particularly effective for redaction tasks.
Architecture
Piiranha is a fine-tuned version of Microsoft's DeBERTa-v3-base model, which supports a context length of 256 tokens. The model is trained to recognize PII in English, Spanish, French, German, Italian, and Dutch.
Training
The model was trained using a dataset of approximately 73,000 sentences containing PII, achieving high precision (93.16%), recall (93.08%), and F1-score (93.12%). Training was conducted using H100 GPUs sponsored by the Akash Network, with the following hyperparameters:
- Learning rate: 5e-05
- Batch size: 128
- Optimizer: Adam
- Number of epochs: 5
Guide: Running Locally
To run Piiranha locally, follow these steps:
- Clone the repository from Hugging Face.
- Install the required dependencies:
pip install transformers torch datasets
- Load the model and tokenizer using the Transformers library:
from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("iiiorg/piiranha-v1-detect-personal-information") model = AutoModelForTokenClassification.from_pretrained("iiiorg/piiranha-v1-detect-personal-information")
- Use the model to detect PII in your text.
For optimal performance, consider using a cloud GPU service such as AWS EC2, Google Cloud, or Azure.
License
Piiranha is released under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (cc-by-nc-nd-4.0). This means you can use the model non-commercially, but cannot distribute derivatives.