Modern B E R T base LLM Model

Introduction

ModernBERT is a bidirectional encoder-only Transformer model designed for long-context tasks, utilizing advancements such as Rotary Positional Embeddings and Local-Global Alternating Attention. It is pre-trained on 2 trillion tokens of English and code data, making it suitable for a wide variety of tasks, including code retrieval and hybrid semantic search.

Architecture

ModernBERT integrates recent architectural improvements:

Rotary Positional Embeddings (RoPE): Enhances long-context support.
Local-Global Alternating Attention: Optimizes efficiency on long inputs.
Unpadding and Flash Attention: Enables efficient inference.

Training

Architecture: Encoder-only Pre-Norm Transformer with GeGLU activations.
Sequence Length: Initially trained up to 1,024 tokens, extended to 8,192 tokens.
Data: Trained on 2 trillion tokens of English text and code.
Optimizer: StableAdamW with trapezoidal LR scheduling and 1-sqrt decay.
Hardware: Utilized 8x H100 GPUs for training.

Model Stats Number

ModernBERT is available in two sizes:

ModernBERT-base: 22 layers, 149 million parameters.
ModernBERT-large: 28 layers, 395 million parameters.

Guide: Running Locally

Install Transformers Library:

pip install git+https://github.com/huggingface/transformers.git

Install Flash Attention (Optional for Efficiency):
```
pip install flash-attn
```

Load and Use Model:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get predictions for the mask
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)

Suggested Cloud GPUs: Consider using cloud-based solutions like AWS, Google Cloud, or Azure for access to high-performance GPUs if local resources are insufficient.