BiodivBERT Documentation

Introduction

BiodivBERT is a domain-specific BERT-based model tailored for the biodiversity literature. It is pre-trained on extensive biodiversity-related texts and is fine-tuned for tasks such as Named Entity Recognition (NER) and Relation Extraction in the biodiversity domain.

Architecture

Model Type: Domain-specific BERT-based cased model.
Tokenizer: Utilizes the BERT base cased tokenizer.
Pre-training: Conducted on abstracts and full texts from biodiversity literature, specifically from Springer and Elsevier publications spanning 1990-2020.

Training

Training Data: Crawled using keywords related to biodiversity and sourced from Springer and Elsevier APIs.
Hyperparameters:
- Maximum Length (MAX_LEN): 512
- Masked Language Model Proportion (MLM_PROP): 0.15
- Training Epochs: 3
- Batch Size (Train/Eval): 16
- Gradient Accumulation Steps: 4

Guide: Running Locally

Basic Steps

Masked Language Model:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")

Token Classification (NER):

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")

Sequence Classification (Relation Extraction):

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")

Cloud GPUs

For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

BiodivBERT is released under the Apache-2.0 license.

More Related APIs in Token Classification

Biodiv B E R T