Biomed N L P Biomed B E R T base uncased abstract fulltext
microsoftIntroduction
BiomedBERT, developed by Microsoft, is a domain-specific language model pretrained for biomedical natural language processing tasks. It is crafted from scratch using abstracts and full-text articles from PubMed and PubMedCentral, achieving state-of-the-art performance in biomedical NLP tasks.
Architecture
BiomedBERT is built on the BERT architecture, specifically designed for the biomedical domain. It utilizes a bidirectional transformer mechanism and is provided in an uncased version, implying it does not distinguish between uppercase and lowercase letters. The model supports various frameworks including PyTorch and JAX.
Training
The model was pretrained using vast amounts of biomedical data, particularly abstracts from PubMed and full-text articles from PubMedCentral. The approach of pretraining from scratch, as opposed to fine-tuning a general-domain model, results in significant performance improvements for biomedical tasks. This model is benchmarked against the Biomedical Language Understanding and Reasoning Benchmark (BLURB).
Guide: Running Locally
To run BiomedBERT locally, follow these basic steps:
- Install the Transformers library:
pip install transformers
- Load the model:
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext") model = AutoModel.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext")
- Use the model for inference:
inputs = tokenizer("DNA is the [MASK] of life.", return_tensors="pt") outputs = model(**inputs)
For more computationally intensive tasks, consider using cloud-based solutions like AWS, Google Cloud, or Azure, which provide access to GPUs and TPUs.
License
BiomedBERT is released under the MIT License, allowing for flexible use and distribution.