bioformer 8 L
bioformersIntroduction
Bioformer-8L is a lightweight BERT model designed for biomedical text mining. It employs a biomedical-specific vocabulary and is pre-trained solely on biomedical domain corpora. It demonstrates efficiency, being three times faster than BERT-base, while offering comparable or superior performance to BioBERT/PubMedBERT on various NLP tasks. The model comprises 8 transformer layers, a hidden embedding size of 512, and 8 self-attention heads, totaling 42,820,610 parameters.
Architecture
Bioformer-8L uses a cased WordPiece vocabulary derived from a biomedical corpus, consisting of 33 million PubMed abstracts and 1 million PMC full-text articles. The vocabulary size is 32,768, similar to the original BERT. The pre-training process involves whole-word masking with a 15% masking rate and includes the Next Sentence Prediction (NSP) objective for potential downstream task requirements.
Training
Bioformer-8L was pre-trained using a single Cloud TPU device with a maximum sequence length of 512 and a batch size of 256, over 2 million steps, amounting to approximately 8.3 days of training. The training utilized SciSpacy for sentence segmentation.
Guide: Running Locally
Prerequisites:
- Python 3
- PyTorch
- Transformers
- Datasets
Installation Steps:
- Install PyTorch by following the instructions.
- Install the
transformers
anddatasets
libraries:pip install transformers pip install datasets
- Use the model with the
transformers
library:from transformers import pipeline unmasker8L = pipeline('fill-mask', model='bioformers/bioformer-8L') unmasker8L("[MASK] refers to a group of diseases that affect how the body uses blood sugar (glucose)")
Cloud GPUs:
Consider using cloud GPU services for efficient execution and resource management.
License
Bioformer-8L is licensed under the Apache-2.0 License.