Med B E R T
CharanganIntroduction
MedBERT is a pre-trained transformer-based language model specifically designed for biomedical named entity recognition. It is initialized with Bio_ClinicalBERT and pre-trained using datasets like N2C2, BioNLP, and CRAFT.
Architecture
MedBERT is built upon the architecture of BERT, leveraging the Bio_ClinicalBERT model for initialization. It uses transformers, employing PyTorch as the underlying library for model deployment.
Training
Data
The training data includes:
- N2C2: Clinical notes from the N2C2 2018 and 2022 challenges.
- BioNLP: Articles covering biomedical disciplines including molecular biology and infectious diseases.
- CRAFT: Full-text biomedical journal articles from PubMed Central.
- Wikipedia: Crawled articles related to medical topics.
Procedures
The training code is based on Google’s BERT repository. Bio_ClinicalBERT was used for initializing model parameters.
Hyperparameters
- Batch Size: 32
- Maximum Sequence Length: 256
- Learning Rate: 1e-4
- Training Steps: 200,000
- Duplication Factor: 5
- Masked Language Model Probability: 0.15
- Max Predictions Per Sequence: 22
Guide: Running Locally
To use MedBERT locally, follow these steps:
- Install the
transformers
library if not already installed:pip install transformers
- Initialize the tokenizer and model:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Charangan/MedBERT") model = AutoModel.from_pretrained("Charangan/MedBERT")
For more efficient processing, consider using cloud GPUs such as those provided by Google Cloud, AWS, or Azure.
License
MedBERT is licensed under the MIT License, allowing for open use and modification.