Clinical B E R T
medicalaiIntroduction
ClinicalBERT is a specialized language model designed for the medical domain. It was pre-trained on a vast multicenter dataset comprising 1.2 billion words from diverse disease-related texts and fine-tuned on electronic health records (EHRs) from over 3 million patient records.
Architecture
ClinicalBERT is built upon the BERT architecture, employing a masked language model approach. This involves randomly masking tokens in a text and training the model to predict the original tokens based on the surrounding context.
Training
- Pretraining Data: The model was trained on a large corpus of 1.2 billion words and EHRs.
- Pretraining Procedures: ClinicalBERT was initialized from BERT and trained using masked language modeling.
- Pretraining Hyperparameters: Utilized a batch size of 32, maximum sequence length of 256, and a learning rate of 5e-5.
Guide: Running Locally
To use ClinicalBERT locally, follow these steps:
- Install the Hugging Face
transformers
library. - Load the model and tokenizer with the following code:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("medicalai/ClinicalBERT") model = AutoModel.from_pretrained("medicalai/ClinicalBERT")
- You can utilize cloud GPUs such as those offered by AWS, Google Cloud, or Azure to run the model efficiently.
License
The model's usage and distribution are subject to the terms specified in its associated license, which should be reviewed for compliance.