biobert large cased v1.1 squad
dmis-labIntroduction
BioBERT-Large-Cased-v1.1-SQuAD is a question-answering model developed by the DMIS-lab at Korea University. It extends the BERT architecture with training data from biomedical sources, aiming to enhance performance on tasks related to biomedical text mining.
Architecture
BioBERT is based on the BERT architecture, specifically designed for question answering tasks. It is fine-tuned on biomedical corpora to improve its understanding and accuracy in the biomedical domain.
Training
Training Data
The model is pre-trained using BERTBASE on English Wikipedia and BooksCorpus, followed by additional training on PubMed and PMC datasets. The optimal pre-training steps were found to be 200K for PubMed and 270K for PMC.
Training Procedure
BioBERT was pre-trained using Naver Smart Machine Learning (NSML), a platform supporting large-scale experiments on multiple GPUs. The maximum sequence length is 512, and the mini-batch size is 192.
Environmental Impact
Training was conducted on eight NVIDIA V100 (32GB) GPUs, while fine-tuning used a single NVIDIA Titan Xp (12GB) GPU. More specific details on carbon emissions are not provided.
Guide: Running Locally
To run the BioBERT model locally, follow these steps:
-
Install dependencies: Ensure you have Python and the
transformers
library installed. -
Load the model: Use the following code to load the model and tokenizer:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-large-cased-v1.1-squad") model = AutoModelForQuestionAnswering.from_pretrained("dmis-lab/biobert-large-cased-v1.1-squad")
-
Inference: Use the loaded model and tokenizer for question-answering tasks on biomedical texts.
For optimal performance, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.
License
The specific license details for BioBERT-Large-Cased-v1.1-SQuAD are not provided. Users should verify the licensing terms before use.