Modern B E R T base Col B E R T
Y-J-JuIntroduction
ModernBERT-base-ColBERT is a PyLate model built on the ModernBERT-base architecture, specialized for semantic textual similarity using sentence embeddings. It maps sentences and paragraphs to 128-dimensional vectors, utilizing the MaxSim operator for similarity assessments.
Architecture
The model employs a ColBERT architecture, which includes:
- A Transformer component with a maximum sequence length of 179 tokens.
- A Dense layer reducing features from 768 to 128 dimensions without bias, using an identity activation function.
Training
ModernBERT-base-ColBERT is fine-tuned on the lightonai/ms-marco-en-bge dataset comprising 808,728 samples. The training process utilizes distillation loss. Key hyperparameters include a batch size of 16, a learning rate of 8e-05, and a warmup ratio of 0.05. Training results show improved nDCG@10 scores across several datasets compared to baseline models.
Guide: Running Locally
-
Install PyLate:
pip install -U pylate
-
Model Loading and Indexing:
- Load the ColBERT model and initialize the Voyager index.
- Encode documents and add them to the index to prepare for retrieval tasks.
-
Retrieving Documents:
- Use the ColBERT retriever to encode queries and retrieve top-k relevant documents from the indexed dataset.
-
Reranking:
- For reranking without indexing, encode and rank documents directly based on their similarity to queries.
-
Suggested Environment:
- Utilize cloud GPUs for efficient processing, such as AWS EC2 instances with NVIDIA GPUs, to handle large datasets and model computations.
License
The documentation does not specify a license. For precise licensing information, refer to the official PyLate repository or contact the authors.