Math B E R T
tbs17Introduction
MathBERT is a pretrained transformers model designed for applications related to mathematics, using a masked language modeling (MLM) objective. It is uncased, meaning it does not distinguish between lowercase and uppercase letters.
Architecture
MathBERT is based on the BERT architecture, pretrained on a large corpus of English math language data. The model is trained using two primary objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). MLM involves masking 15% of words in a sentence and predicting them, allowing the model to learn bidirectional sentence representations. NSP involves predicting if two concatenated masked sentences were originally sequential.
Training
Training Data
MathBERT was pretrained on a diverse math corpus, including pre-K to high school curricula, college math textbooks, and graduate-level math from arXiv abstracts, comprising approximately 100 million tokens.
Training Procedure
The training involves tokenizing texts using WordPiece with a vocabulary size of 30,522. Sentences are masked with a probability of 15%, and the model is trained on cloud TPUs for 600,000 steps with a batch size of 128. The optimizer used is Adam with specific hyperparameters.
Guide: Running Locally
To use MathBERT locally, you can follow these steps:
-
Install the transformers library:
pip install transformers
-
Load the model and tokenizer in PyTorch:
from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT', output_hidden_states=True) model = BertModel.from_pretrained('tbs17/MathBERT') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(encoded_input)
-
Load the model and tokenizer in TensorFlow:
from transformers import BertTokenizer, TFBertModel tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT', output_hidden_states=True) model = TFBertModel.from_pretrained('tbs17/MathBERT') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)
For more computationally intensive tasks, consider using cloud GPUs from providers such as AWS, Google Cloud, or Azure to accelerate processing.
License
The MathBERT model and its associated code can be accessed on GitHub at https://github.com/tbs17/MathBERT. Please refer to the repository for specific licensing information.