Biodiv B E R T
NoYo25BiodivBERT Documentation
Introduction
BiodivBERT is a domain-specific BERT-based model tailored for the biodiversity literature. It is pre-trained on extensive biodiversity-related texts and is fine-tuned for tasks such as Named Entity Recognition (NER) and Relation Extraction in the biodiversity domain.
Architecture
- Model Type: Domain-specific BERT-based cased model.
- Tokenizer: Utilizes the BERT base cased tokenizer.
- Pre-training: Conducted on abstracts and full texts from biodiversity literature, specifically from Springer and Elsevier publications spanning 1990-2020.
Training
- Training Data: Crawled using keywords related to biodiversity and sourced from Springer and Elsevier APIs.
- Hyperparameters:
- Maximum Length (MAX_LEN): 512
- Masked Language Model Proportion (MLM_PROP): 0.15
- Training Epochs: 3
- Batch Size (Train/Eval): 16
- Gradient Accumulation Steps: 4
Guide: Running Locally
Basic Steps
- Masked Language Model:
from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT") model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")
- Token Classification (NER):
from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT") model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")
- Sequence Classification (Relation Extraction):
from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT") model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")
Cloud GPUs
For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
BiodivBERT is released under the Apache-2.0 license.