bioformer 8 L

bioformers

Introduction

Bioformer-8L is a lightweight BERT model designed for biomedical text mining. It employs a biomedical-specific vocabulary and is pre-trained solely on biomedical domain corpora. It demonstrates efficiency, being three times faster than BERT-base, while offering comparable or superior performance to BioBERT/PubMedBERT on various NLP tasks. The model comprises 8 transformer layers, a hidden embedding size of 512, and 8 self-attention heads, totaling 42,820,610 parameters.

Architecture

Bioformer-8L uses a cased WordPiece vocabulary derived from a biomedical corpus, consisting of 33 million PubMed abstracts and 1 million PMC full-text articles. The vocabulary size is 32,768, similar to the original BERT. The pre-training process involves whole-word masking with a 15% masking rate and includes the Next Sentence Prediction (NSP) objective for potential downstream task requirements.

Training

Bioformer-8L was pre-trained using a single Cloud TPU device with a maximum sequence length of 512 and a batch size of 256, over 2 million steps, amounting to approximately 8.3 days of training. The training utilized SciSpacy for sentence segmentation.

Guide: Running Locally

Prerequisites:

  • Python 3
  • PyTorch
  • Transformers
  • Datasets

Installation Steps:

  1. Install PyTorch by following the instructions.
  2. Install the transformers and datasets libraries:
    pip install transformers
    pip install datasets
    
  3. Use the model with the transformers library:
    from transformers import pipeline
    unmasker8L = pipeline('fill-mask', model='bioformers/bioformer-8L')
    unmasker8L("[MASK] refers to a group of diseases that affect how the body uses blood sugar (glucose)")
    

Cloud GPUs:
Consider using cloud GPU services for efficient execution and resource management.

License

Bioformer-8L is licensed under the Apache-2.0 License.

More Related APIs in Fill Mask