Modern B E R T base Col B E R T LLM Model

Introduction

ModernBERT-base-ColBERT is a PyLate model built on the ModernBERT-base architecture, specialized for semantic textual similarity using sentence embeddings. It maps sentences and paragraphs to 128-dimensional vectors, utilizing the MaxSim operator for similarity assessments.

Architecture

The model employs a ColBERT architecture, which includes:

A Transformer component with a maximum sequence length of 179 tokens.
A Dense layer reducing features from 768 to 128 dimensions without bias, using an identity activation function.

Training

ModernBERT-base-ColBERT is fine-tuned on the lightonai/ms-marco-en-bge dataset comprising 808,728 samples. The training process utilizes distillation loss. Key hyperparameters include a batch size of 16, a learning rate of 8e-05, and a warmup ratio of 0.05. Training results show improved nDCG@10 scores across several datasets compared to baseline models.

Guide: Running Locally

Install PyLate:
```
pip install -U pylate
```
Model Loading and Indexing:
- Load the ColBERT model and initialize the Voyager index.
- Encode documents and add them to the index to prepare for retrieval tasks.
Retrieving Documents:
- Use the ColBERT retriever to encode queries and retrieve top-k relevant documents from the indexed dataset.
Reranking:
- For reranking without indexing, encode and rank documents directly based on their similarity to queries.
Suggested Environment:
- Utilize cloud GPUs for efficient processing, such as AWS EC2 instances with NVIDIA GPUs, to handle large datasets and model computations.

License

The documentation does not specify a license. For precise licensing information, refer to the official PyLate repository or contact the authors.

More Related APIs in Sentence Similarity