Modern B E R T base Col B E R T

Y-J-Ju

Introduction

ModernBERT-base-ColBERT is a PyLate model built on the ModernBERT-base architecture, specialized for semantic textual similarity using sentence embeddings. It maps sentences and paragraphs to 128-dimensional vectors, utilizing the MaxSim operator for similarity assessments.

Architecture

The model employs a ColBERT architecture, which includes:

  • A Transformer component with a maximum sequence length of 179 tokens.
  • A Dense layer reducing features from 768 to 128 dimensions without bias, using an identity activation function.

Training

ModernBERT-base-ColBERT is fine-tuned on the lightonai/ms-marco-en-bge dataset comprising 808,728 samples. The training process utilizes distillation loss. Key hyperparameters include a batch size of 16, a learning rate of 8e-05, and a warmup ratio of 0.05. Training results show improved nDCG@10 scores across several datasets compared to baseline models.

Guide: Running Locally

  1. Install PyLate:

    pip install -U pylate
    
  2. Model Loading and Indexing:

    • Load the ColBERT model and initialize the Voyager index.
    • Encode documents and add them to the index to prepare for retrieval tasks.
  3. Retrieving Documents:

    • Use the ColBERT retriever to encode queries and retrieve top-k relevant documents from the indexed dataset.
  4. Reranking:

    • For reranking without indexing, encode and rank documents directly based on their similarity to queries.
  5. Suggested Environment:

    • Utilize cloud GPUs for efficient processing, such as AWS EC2 instances with NVIDIA GPUs, to handle large datasets and model computations.

License

The documentation does not specify a license. For precise licensing information, refer to the official PyLate repository or contact the authors.

More Related APIs in Sentence Similarity