scibert_scivocab_uncased

allenai

Introduction

SciBERT is a pretrained language model specifically designed for scientific texts. It is based on BERT and has been trained on a comprehensive corpus from Semantic Scholar, consisting of 1.14 million papers and 3.1 billion tokens. The model comes in both cased and uncased versions, each utilizing a custom wordpiece vocabulary known as "scivocab," tailored to the training data.

Architecture

SciBERT follows the architecture of BERT, with modifications to optimize performance on scientific texts. It uses a WordPiece tokenizer with a specialized vocabulary designed to reflect the linguistic characteristics of the scientific literature.

Training

The model was trained using the full text of papers from the Semantic Scholar database. This extensive dataset includes a variety of scientific disciplines, ensuring broad coverage and applicability. The training process involved both cased and uncased models, providing flexibility depending on the case-sensitivity requirements of different applications.

Guide: Running Locally

  1. Prerequisites: Ensure you have Python and PyTorch installed. Install the transformers library from Hugging Face.

    pip install transformers
    
  2. Download the Model: Use Hugging Face's transformers library to download and load the model.

    from transformers import AutoModel, AutoTokenizer
    
    model_name = "allenai/scibert_scivocab_uncased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
  3. Inference: Prepare your scientific text data and use the model for inference.

    inputs = tokenizer("Your scientific text here", return_tensors="pt")
    outputs = model(**inputs)
    
  4. Hardware Recommendations: For optimal performance, especially with large datasets, consider using cloud GPUs such as AWS EC2 with GPU instances, Google Cloud GPUs, or Azure GPU instances.

License

The SciBERT model is distributed under the Apache 2.0 license, which allows for both personal and commercial use, modification, and distribution of the software. For further details, refer to the original repository: SciBERT GitHub.

More Related APIs