Fairly Multilingual ModernBERT Embedding Model (Belgian Edition)

Introduction

The Fairly Multilingual ModernBERT Embedding Model (Belgian Edition) is designed to efficiently embed texts written in French, Dutch, German, or English. It uses four distinct tokenizers and embedding tables, allowing for cross-lingual sentence embeddings while maintaining performance and speed. This model is particularly useful for tasks like semantic textual similarity and paraphrase mining.

Architecture

The model employs a SentenceTransformer architecture with a base model derived from ModernBERT-Embed-Base. It supports sequences up to 8192 tokens and outputs embeddings in a 768-dimensional vector space. The architecture includes a Transformer layer with specific pooling configurations to produce sentence embeddings.

Training

The model is trained using a parallel dataset containing over 8 million samples of sentence pairs in English, French, Dutch, and German. The training employs the MultipleNegativesRankingLoss function with cosine similarity, fine-tuned using specific hyperparameters to optimize performance.

Training Hyperparameters

Learning Rate: 2e-05
Batch Size: 256
Epochs: 1
Loss Function: MultipleNegativesRankingLoss

Guide: Running Locally

To use this model locally, the following steps are recommended:

Install the Required Libraries:

pip install -U sentence-transformers
pip install --upgrade git+https://github.com/huggingface/transformers.git

Load and Use the Model:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Parallia/Fairly-Multilingual-ModernBERT-Embed-BE", trust_remote_code=True)
sentences = [
    'These three mysterious men came to our help.',
    'Three strange guys helped us then.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)

Cloud GPUs: For enhanced performance, consider using cloud GPU services like AWS, Google Cloud, or Azure. These platforms provide scalable resources suited for intensive computational tasks.

License

This model is available under the Apache 2.0 License, allowing for both personal and commercial use.

More Related APIs in Sentence Similarity