scibert_scivocab_cased
allenaiIntroduction
SciBERT is a pretrained language model specifically designed for scientific text, developed by AI2. It is based on BERT and has been trained using a large corpus of scientific papers from Semantic Scholar, amounting to 1.14 million papers and 3.1 billion tokens. The training included the full texts of the papers.
Architecture
SciBERT utilizes the BERT architecture but is optimized for scientific text through a specialized wordpiece vocabulary called "scivocab." There are both cased and uncased versions of SciBERT, catering to different use cases.
Training
The model was trained using the full text of scientific papers from Semantic Scholar, providing a comprehensive and robust dataset for developing a vocabulary that effectively captures the nuances of scientific language. This training process helps SciBERT perform well in tasks involving scientific literature.
Guide: Running Locally
To run SciBERT locally, follow these steps:
- Install Hugging Face Transformers: Ensure you have the
transformers
library installed.pip install transformers
- Load the Model: Use the
transformers
library to load SciBERT.from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_cased") model = AutoModel.from_pretrained("allenai/scibert_scivocab_cased")
- Inference: Prepare your input data and use the model for inference.
inputs = tokenizer("Your scientific text here", return_tensors="pt") outputs = model(**inputs)
- Use Cloud GPUs: For intensive tasks, consider leveraging cloud GPU services like AWS, GCP, or Azure to accelerate processing.
License
SciBERT is available under licensing terms specified by the original developers, AI2. Users must comply with these terms when utilizing the model. Please refer to the original repository for detailed licensing information.