stanford deidentifier base
StanfordAIMIIntroduction
The Stanford De-Identifier is a model developed to automate the de-identification process of radiology and biomedical documents, achieving high accuracy suitable for production environments. This tool excels in identifying and replacing protected health information (PHI) with realistic surrogates.
Architecture
The model utilizes a transformer-based architecture, specifically leveraging PubMedBERT, to perform token classification tasks. It is fine-tuned on a mix of radiology and biomedical datasets, employing advanced techniques to enhance its de-identification capabilities.
Training
The model was trained on a large, multi-institutional dataset comprising 6,193 documents, including chest X-rays, CT reports, and medical notes. Various PHI detection models were developed using different training datasets, fine-tuning approaches, and data augmentation techniques. The training process focused on optimizing precision, recall, and F1 scores, with the best model achieving remarkable results across multiple test sets.
Guide: Running Locally
To run the Stanford De-Identifier locally, follow these steps:
- Install Dependencies: Ensure you have Python and PyTorch installed. Use pip to install the Hugging Face Transformers library.
pip install transformers torch
- Clone the Repository: Download the model from the associated GitHub repository.
git clone https://github.com/MIDRC/Stanford_Penn_Deidentifier
- Load the Model: Use the Transformers library to load the model and tokenizer.
from transformers import BertTokenizer, BertForTokenClassification tokenizer = BertTokenizer.from_pretrained("StanfordAIMI/stanford-deidentifier-base") model = BertForTokenClassification.from_pretrained("StanfordAIMI/stanford-deidentifier-base")
- Inference: Preprocess your text data and run inference to de-identify documents.
For optimal performance, consider using cloud GPUs available on platforms like AWS, Google Cloud, or Azure.
License
The Stanford De-Identifier is licensed under the MIT License, allowing for modification and distribution, provided the original license is included with substantial portions of the software.