distilmodernbert
andersonbcdefgIntroduction
DistilModernBERT is a distilled version of the ModernBERT-base model, reducing the number of layers from 22 to 16. This reduction results in fewer parameters, from 149M to 119M, and specifically reduces the trunk parameters from 110M to 80M, enhancing performance by reducing latency by approximately 25% and increasing throughput by around 33%.
Architecture
The model architecture involves removing the last six local attention layers while retaining a mix of global and local attention patterns. The original global-local attention pattern is maintained for the remaining layers, requiring specific adjustments in the model configuration due to the limitations of the current HuggingFace modeling code.
Training
The model was distilled using the MiniPile dataset, which contains English and code data. The process involved training on 1 million samples for one epoch using MSE loss on the logits. A batch size of 16, the AdamW optimizer, and a constant learning rate of 1.0e-5 were employed. The embeddings and LM head were frozen and shared between the teacher and student, focusing training on the transformer blocks.
Guide: Running Locally
- Download the checkpoint: Obtain the
model.pt
file from the repository. - Initialize ModernBERT-base:
import torch.nn as nn from transformers import AutoTokenizer, AutoModelForMaskedLM model_id = "answerdotai/ModernBERT-base" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id)
- Remove specific layers:
layers_to_remove = [13, 14, 16, 17, 19, 20] model.model.layers = nn.ModuleList([ layer for idx, layer in enumerate(model.model.layers) if idx not in layers_to_remove ])
- Load the state dictionary:
state_dict = torch.load("model.pt") model.model.load_state_dict(state_dict)
- Use the model: The model is ready for use.
Cloud GPUs: For enhanced performance, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure for running the model.
License
The licensing details for DistilModernBERT are not explicitly outlined in the provided documentation. It is recommended to refer to the repository or contact the author for specific licensing information.