multiberts seed 0 0k
MultiBertGunjanPatrickIntroduction
The MultiBERTs Seed 0 Checkpoint 0K is an intermediate checkpoint of the MultiBERTs model, a pretrained BERT model tailored for the English language using a masked language modeling (MLM) objective. It is uncased, meaning it does not differentiate between cases in English text. This model was introduced in the paper "The MultiBERTs: BERT Reproductions for Robustness Analysis" and developed with the intent of robustness analysis across different BERT reproductions.
Architecture
MultiBERTs models are transformer-based models pretrained on a large corpus of English text. They utilize self-supervised learning, specifically focusing on two primary objectives: masked language modeling (MLM) and next sentence prediction (NSP). MLM involves masking a portion of the input text tokens and training the model to predict those masked tokens, while NSP involves determining whether two sentences follow each other in the original text.
Training
The training data for MultiBERTs includes the BookCorpus and English Wikipedia. The model undergoes preprocessing where texts are lowercased and tokenized using WordPiece with a vocabulary size of 30,000. During training, 15% of tokens are masked, and the model is trained on a combination of Cloud TPU v2 chips with a sequence length of 512 tokens. The optimizer used is Adam with specific parameters for learning rate and decay.
Guide: Running Locally
To run the MultiBERTs model locally using PyTorch, follow these steps:
-
Install the Hugging Face Transformers library using pip:
pip install transformers
-
Load the model and tokenizer:
from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('multiberts-seed-0-0k') model = BertModel.from_pretrained("multiberts-seed-0-0k") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
-
For optimal performance, especially for large inputs, consider using cloud GPUs such as those available from Google Cloud, AWS, or Azure.
License
The MultiBERTs model is licensed under the Apache-2.0 License, allowing for broad use and distribution, provided that the license terms are met.