biobert large cased v1.1 squad LLM Model

Introduction

BioBERT-Large-Cased-v1.1-SQuAD is a question-answering model developed by the DMIS-lab at Korea University. It extends the BERT architecture with training data from biomedical sources, aiming to enhance performance on tasks related to biomedical text mining.

Architecture

BioBERT is based on the BERT architecture, specifically designed for question answering tasks. It is fine-tuned on biomedical corpora to improve its understanding and accuracy in the biomedical domain.

Training

Training Data

The model is pre-trained using BERTBASE on English Wikipedia and BooksCorpus, followed by additional training on PubMed and PMC datasets. The optimal pre-training steps were found to be 200K for PubMed and 270K for PMC.

Training Procedure

BioBERT was pre-trained using Naver Smart Machine Learning (NSML), a platform supporting large-scale experiments on multiple GPUs. The maximum sequence length is 512, and the mini-batch size is 192.

Environmental Impact

Training was conducted on eight NVIDIA V100 (32GB) GPUs, while fine-tuning used a single NVIDIA Titan Xp (12GB) GPU. More specific details on carbon emissions are not provided.

Guide: Running Locally

To run the BioBERT model locally, follow these steps:

Install dependencies: Ensure you have Python and the transformers library installed.

Load the model: Use the following code to load the model and tokenizer:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-large-cased-v1.1-squad")
model = AutoModelForQuestionAnswering.from_pretrained("dmis-lab/biobert-large-cased-v1.1-squad")

Inference: Use the loaded model and tokenizer for question-answering tasks on biomedical texts.

For optimal performance, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.

License

The specific license details for BioBERT-Large-Cased-v1.1-SQuAD are not provided. Users should verify the licensing terms before use.