Introduction

MedBERT is a pre-trained transformer-based language model specifically designed for biomedical named entity recognition. It is initialized with Bio_ClinicalBERT and pre-trained using datasets like N2C2, BioNLP, and CRAFT.

Architecture

MedBERT is built upon the architecture of BERT, leveraging the Bio_ClinicalBERT model for initialization. It uses transformers, employing PyTorch as the underlying library for model deployment.

Training

Data

The training data includes:

  • N2C2: Clinical notes from the N2C2 2018 and 2022 challenges.
  • BioNLP: Articles covering biomedical disciplines including molecular biology and infectious diseases.
  • CRAFT: Full-text biomedical journal articles from PubMed Central.
  • Wikipedia: Crawled articles related to medical topics.

Procedures

The training code is based on Google’s BERT repository. Bio_ClinicalBERT was used for initializing model parameters.

Hyperparameters

  • Batch Size: 32
  • Maximum Sequence Length: 256
  • Learning Rate: 1e-4
  • Training Steps: 200,000
  • Duplication Factor: 5
  • Masked Language Model Probability: 0.15
  • Max Predictions Per Sequence: 22

Guide: Running Locally

To use MedBERT locally, follow these steps:

  1. Install the transformers library if not already installed:
    pip install transformers
    
  2. Initialize the tokenizer and model:
    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained("Charangan/MedBERT")
    model = AutoModel.from_pretrained("Charangan/MedBERT")
    

For more efficient processing, consider using cloud GPUs such as those provided by Google Cloud, AWS, or Azure.

License

MedBERT is licensed under the MIT License, allowing for open use and modification.

More Related APIs in Fill Mask