Introduction

GeoBERT is a Named Entity Recognition (NER) model fine-tuned from SciBERT, specifically developed to process the Geoscientific Corpus dataset. It is designed to identify specific semantic categories within geoscientific texts.

Architecture

GeoBERT is built upon the SciBERT architecture, adapted to understand and categorize entities within the geoscience domain. It is trained to recognize four main semantic types:

  1. GeoPetro: Entities related to geosciences.
  2. GeoMeth: Tools or methods in geosciences.
  3. GeoLoc: Geological locations.
  4. GeoTime: Geological time scale entities.

Training

GeoBERT was trained on the Labeled Geoscientific Corpus dataset, which includes approximately 1 million sentences. The training utilized the following hyperparameters:

  • Optimizer: AdamWeightDecay with a learning rate set by PolynomialDecay.
  • Training precision: mixed_float16.

Framework versions used include Transformers 4.22.1, TensorFlow 2.10.0, Datasets 2.4.0, and Tokenizers 0.12.1.

Model Performance

The model's performance, evaluated using the SEQEVAL metric, is as follows:

  • GeoLoc: Precision 0.9727, Recall 0.9591, F1 Score 0.9658
  • GeoMeth: Precision 0.9433, Recall 0.9447, F1 Score 0.9445
  • GeoPetro: Precision 0.9767, Recall 0.9745, F1 Score 0.9756
  • GeoTime: Precision 0.9695, Recall 0.9666, F1 Score 0.9680

Guide: Running Locally

To use GeoBERT locally, follow these steps:

  1. Install the Transformers library:

    pip install transformers
    
  2. Load GeoBERT and its tokenizer:

    from transformers import AutoTokenizer, AutoModelForTokenClassification
    
    tokenizer = AutoTokenizer.from_pretrained("botryan96/GeoBERT")
    model = AutoModelForTokenClassification.from_pretrained("botryan96/GeoBERT")
    
  3. Define the pipeline:

    from transformers import pipeline
    
    ner_machine = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
    
  4. Deploy the NER machine on a sample sentence:

    sentence = "In North America, the water storage in the seepage face model is higher than the base case..."
    ner_results = ner_machine(sentence)
    print(ner_results)
    

For optimal performance, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.

License

GeoBERT is made available under the terms specified in its repository, adhering to the conditions outlined for its use and modification. Ensure to review the license before deploying or modifying the model.

More Related APIs in Token Classification