nomic bert 2048
nomic-aiIntroduction
NOMIC-BERT-2048 is a pretrained BERT model developed by Nomic AI, featuring a maximum sequence length of 2048 tokens. It incorporates several enhancements over traditional BERT models, including Rotary Position Embeddings and SwiGLU activations, providing benefits such as context length extrapolation and improved performance without dropout.
Architecture
The model employs Rotary Position Embeddings and SwiGLU activations, which are advancements inspired by MosaicBERT. These modifications aim to extend the model's context length and enhance its performance on various NLP tasks. The model's architecture allows it to handle longer sequences effectively, making it suitable for applications requiring extensive context.
Training
NOMIC-BERT-2048 was trained using data from BookCorpus and a 2023 Wikipedia dump. The training process involved tokenizing sequences to 2048 tokens. For sequences shorter than 2048 tokens, additional documents were appended. Longer documents were split to fit the sequence length. The model was evaluated using the GLUE benchmark, demonstrating comparable performance to other BERT models, with the added benefit of handling longer sequences.
Guide: Running Locally
To use NOMIC-BERT-2048 for masked language modeling:
- Install the
transformers
library. - Load the model and tokenizer:
from transformers import AutoModelForMaskedLM, AutoConfig, AutoTokenizer, pipeline tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') config = AutoConfig.from_pretrained('nomic-ai/nomic-bert-2048', trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained('nomic-ai/nomic-bert-2048', config=config, trust_remote_code=True) classifier = pipeline('fill-mask', model=model, tokenizer=tokenizer, device="cpu") print(classifier("I [MASK] to the store yesterday."))
- To finetune for a Sequence Classification task:
from transformers import AutoConfig, AutoModelForSequenceClassification model_path = "nomic-ai/nomic-bert-2048" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForSequenceClassification.from_pretrained(model_path, config=config, trust_remote_code=True, strict=False)
- Consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure for more efficient training and inference.
License
NOMIC-BERT-2048 is released under the Apache 2.0 license, which allows for both personal and commercial use, modification, and distribution, provided that the license terms are met.