Geo B E R T
botryan96Introduction
GeoBERT is a Named Entity Recognition (NER) model fine-tuned from SciBERT, specifically developed to process the Geoscientific Corpus dataset. It is designed to identify specific semantic categories within geoscientific texts.
Architecture
GeoBERT is built upon the SciBERT architecture, adapted to understand and categorize entities within the geoscience domain. It is trained to recognize four main semantic types:
- GeoPetro: Entities related to geosciences.
- GeoMeth: Tools or methods in geosciences.
- GeoLoc: Geological locations.
- GeoTime: Geological time scale entities.
Training
GeoBERT was trained on the Labeled Geoscientific Corpus dataset, which includes approximately 1 million sentences. The training utilized the following hyperparameters:
- Optimizer: AdamWeightDecay with a learning rate set by PolynomialDecay.
- Training precision: mixed_float16.
Framework versions used include Transformers 4.22.1, TensorFlow 2.10.0, Datasets 2.4.0, and Tokenizers 0.12.1.
Model Performance
The model's performance, evaluated using the SEQEVAL metric, is as follows:
- GeoLoc: Precision 0.9727, Recall 0.9591, F1 Score 0.9658
- GeoMeth: Precision 0.9433, Recall 0.9447, F1 Score 0.9445
- GeoPetro: Precision 0.9767, Recall 0.9745, F1 Score 0.9756
- GeoTime: Precision 0.9695, Recall 0.9666, F1 Score 0.9680
Guide: Running Locally
To use GeoBERT locally, follow these steps:
-
Install the Transformers library:
pip install transformers
-
Load GeoBERT and its tokenizer:
from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("botryan96/GeoBERT") model = AutoModelForTokenClassification.from_pretrained("botryan96/GeoBERT")
-
Define the pipeline:
from transformers import pipeline ner_machine = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
-
Deploy the NER machine on a sample sentence:
sentence = "In North America, the water storage in the seepage face model is higher than the base case..." ner_results = ner_machine(sentence) print(ner_results)
For optimal performance, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.
License
GeoBERT is made available under the terms specified in its repository, adhering to the conditions outlined for its use and modification. Ensure to review the license before deploying or modifying the model.