stanford deidentifier only radiology reports
StanfordAIMIIntroduction
The Stanford de-identifier is an automated tool designed for the de-identification of radiology and biomedical documents, aiming to ensure privacy by removing protected health information (PHI) while maintaining high accuracy for production use.
Architecture
The model is based on transformer architectures and utilizes the PubMedBERT model, designed for token classification tasks. It incorporates a combination of transformer-based methods and rule-based techniques for effective de-identification. This hybrid approach allows the tool to replace PHI with realistic surrogates, effectively "hiding" such information in plain sight.
Training
The de-identifier was trained on a diverse dataset comprising 6,193 documents, including chest X-ray and CT reports, as well as medical notes. The dataset spans multiple institutions and domains, providing a robust basis for training the model. Various PHI detection models were developed using different datasets, fine-tuning techniques, and data augmentation strategies. The model's performance was evaluated using precision, recall, F1 score, and paired samples Wilcoxon tests, achieving top-tier results on multiple test sets.
Guide: Running Locally
-
Clone the Repository:
Clone the associated GitHub repository:git clone https://github.com/MIDRC/Stanford_Penn_Deidentifier
-
Install Dependencies:
Ensure you have PyTorch and Transformers installed:pip install torch transformers
-
Load the Model:
Use the provided scripts or Hugging Face's Transformers library to load the model for de-identification tasks. -
Run Inference:
Input your radiology reports into the model to receive de-identified outputs. -
Utilize Cloud GPUs:
For improved performance, consider using cloud-based GPU services such as AWS EC2, Google Cloud, or Azure.
License
The Stanford de-identifier is released under the MIT License, permitting free use, modification, and distribution of the software, provided that the original license and copyright notice are included in all copies or substantial portions of the software.