Modern B E R T large
answerdotaiModernBERT-large
Introduction
ModernBERT is a modernized BERT-style Transformer model designed for long-context language processing. It supports tasks such as retrieval, classification, and semantic search, leveraging recent architectural improvements for efficiency. Trained on a mix of English text and code data, it is suitable for various downstream applications including code retrieval and hybrid semantic search.
Architecture
ModernBERT uses a bidirectional encoder-only Transformer architecture with enhancements like Rotary Positional Embeddings (RoPE) for long-context support and Local-Global Alternating Attention. It employs Unpadding and Flash Attention for efficient inference, with a context length of up to 8,192 tokens. The model is available in two sizes: ModernBERT-base (22 layers, 149M parameters) and ModernBERT-large (28 layers, 395M parameters).
Training
- Architecture: Encoder-only, Pre-Norm Transformer with GeGLU activations.
- Sequence Length: Pre-trained with sequences up to 1,024 tokens, extended to 8,192 tokens.
- Data: 2 trillion tokens of English text and code.
- Optimizer: StableAdamW with trapezoidal learning rate scheduling and 1-sqrt decay.
- Hardware: Trained on 8x H100 GPUs.
395 million parameters
Guide: Running Locally
-
Install Transformers:
pip install git+https://github.com/huggingface/transformers.git
-
Install Flash Attention (if supported by GPU):
pip install flash-attn
-
Load Model and Tokenizer:
from transformers import AutoTokenizer, AutoModelForMaskedLM model_id = "answerdotai/ModernBERT-base" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id)
-
Inference Example:
text = "The capital of France is [MASK]." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1) predicted_token = tokenizer.decode(predicted_token_id) print("Predicted token:", predicted_token)
For optimal performance, consider using cloud GPUs such as NVIDIA's A100 or H100 for running the model.
License
ModernBERT is released under the Apache 2.0 license, including its architectures, model weights, and training codebase.