Biodiv B E R T

NoYo25

BiodivBERT Documentation

Introduction

BiodivBERT is a domain-specific BERT-based model tailored for the biodiversity literature. It is pre-trained on extensive biodiversity-related texts and is fine-tuned for tasks such as Named Entity Recognition (NER) and Relation Extraction in the biodiversity domain.

Architecture

  • Model Type: Domain-specific BERT-based cased model.
  • Tokenizer: Utilizes the BERT base cased tokenizer.
  • Pre-training: Conducted on abstracts and full texts from biodiversity literature, specifically from Springer and Elsevier publications spanning 1990-2020.

Training

  • Training Data: Crawled using keywords related to biodiversity and sourced from Springer and Elsevier APIs.
  • Hyperparameters:
    • Maximum Length (MAX_LEN): 512
    • Masked Language Model Proportion (MLM_PROP): 0.15
    • Training Epochs: 3
    • Batch Size (Train/Eval): 16
    • Gradient Accumulation Steps: 4

Guide: Running Locally

Basic Steps

  1. Masked Language Model:
    from transformers import AutoTokenizer, AutoModelForMaskedLM
    
    tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
    model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")
    
  2. Token Classification (NER):
    from transformers import AutoTokenizer, AutoModelForTokenClassification
    
    tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
    model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")
    
  3. Sequence Classification (Relation Extraction):
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    
    tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
    model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")
    

Cloud GPUs

For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

BiodivBERT is released under the Apache-2.0 license.

More Related APIs in Token Classification