Bio_ Clinical B E R T

emilyalsentzer

Introduction

The Bio_ClinicalBERT model is a specialized BERT-based model designed for processing clinical language, specifically trained on MIMIC III notes. It combines BioBERT's initialization with additional training on clinical notes, making it suitable for tasks involving medical and clinical text.

Architecture

Bio_ClinicalBERT builds on the BERT-Base architecture, initialized from BioBERT. It consists of 12 layers, a hidden size of 768, and 12 attention heads. The model is optimized for understanding contextual relationships in the medical domain.

Training

Pretraining Data: The model utilizes the MIMIC III database, which includes extensive electronic health records. All notes (~880M words) from the NOTEEVENTS table were utilized.

Note Preprocessing: Notes are processed into sections and further divided into sentences using SciSpacy tools.

Pretraining Procedures: Training was conducted on a GeForce GTX TITAN X GPU, using BioBERT-initialized parameters. The training involved a batch size of 32, sequence length of 128, learning rate of 5e-5, and lasted for 150,000 steps with specific masking and duplication strategies.

Guide: Running Locally

  1. Install Transformers Library:

    pip install transformers
    
  2. Load the Model in Python:

    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
    model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
    
  3. Cloud GPU Recommendation: For efficient training and inference, consider using cloud services offering NVIDIA GPUs such as AWS EC2 (P2 or P3 instances), Google Cloud Platform, or Azure.

License

The Bio_ClinicalBERT model is released under the MIT License, allowing for flexibility in both academic and commercial use.

More Related APIs in Fill Mask