Math B E R T

tbs17

Introduction

MathBERT is a pretrained transformers model designed for applications related to mathematics, using a masked language modeling (MLM) objective. It is uncased, meaning it does not distinguish between lowercase and uppercase letters.

Architecture

MathBERT is based on the BERT architecture, pretrained on a large corpus of English math language data. The model is trained using two primary objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). MLM involves masking 15% of words in a sentence and predicting them, allowing the model to learn bidirectional sentence representations. NSP involves predicting if two concatenated masked sentences were originally sequential.

Training

Training Data

MathBERT was pretrained on a diverse math corpus, including pre-K to high school curricula, college math textbooks, and graduate-level math from arXiv abstracts, comprising approximately 100 million tokens.

Training Procedure

The training involves tokenizing texts using WordPiece with a vocabulary size of 30,522. Sentences are masked with a probability of 15%, and the model is trained on cloud TPUs for 600,000 steps with a batch size of 128. The optimizer used is Adam with specific hyperparameters.

Guide: Running Locally

To use MathBERT locally, you can follow these steps:

  1. Install the transformers library:

    pip install transformers
    
  2. Load the model and tokenizer in PyTorch:

    from transformers import BertTokenizer, BertModel
    tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT', output_hidden_states=True)
    model = BertModel.from_pretrained('tbs17/MathBERT')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(encoded_input)
    
  3. Load the model and tokenizer in TensorFlow:

    from transformers import BertTokenizer, TFBertModel
    tokenizer = BertTokenizer.from_pretrained('tbs17/MathBERT', output_hidden_states=True)
    model = TFBertModel.from_pretrained('tbs17/MathBERT')
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='tf')
    output = model(encoded_input)
    

For more computationally intensive tasks, consider using cloud GPUs from providers such as AWS, Google Cloud, or Azure to accelerate processing.

License

The MathBERT model and its associated code can be accessed on GitHub at https://github.com/tbs17/MathBERT. Please refer to the repository for specific licensing information.

More Related APIs in Fill Mask