Fairly Multilingual Modern B E R T Embed B E
ParalliaFairly Multilingual ModernBERT Embedding Model (Belgian Edition)
Introduction
The Fairly Multilingual ModernBERT Embedding Model (Belgian Edition) is designed to efficiently embed texts written in French, Dutch, German, or English. It uses four distinct tokenizers and embedding tables, allowing for cross-lingual sentence embeddings while maintaining performance and speed. This model is particularly useful for tasks like semantic textual similarity and paraphrase mining.
Architecture
The model employs a SentenceTransformer architecture with a base model derived from ModernBERT-Embed-Base. It supports sequences up to 8192 tokens and outputs embeddings in a 768-dimensional vector space. The architecture includes a Transformer layer with specific pooling configurations to produce sentence embeddings.
Training
The model is trained using a parallel dataset containing over 8 million samples of sentence pairs in English, French, Dutch, and German. The training employs the MultipleNegativesRankingLoss function with cosine similarity, fine-tuned using specific hyperparameters to optimize performance.
Training Hyperparameters
- Learning Rate: 2e-05
- Batch Size: 256
- Epochs: 1
- Loss Function: MultipleNegativesRankingLoss
Guide: Running Locally
To use this model locally, the following steps are recommended:
-
Install the Required Libraries:
pip install -U sentence-transformers pip install --upgrade git+https://github.com/huggingface/transformers.git
-
Load and Use the Model:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Parallia/Fairly-Multilingual-ModernBERT-Embed-BE", trust_remote_code=True) sentences = [ 'These three mysterious men came to our help.', 'Three strange guys helped us then.', ] embeddings = model.encode(sentences) print(embeddings.shape)
-
Cloud GPUs: For enhanced performance, consider using cloud GPU services like AWS, Google Cloud, or Azure. These platforms provide scalable resources suited for intensive computational tasks.
License
This model is available under the Apache 2.0 License, allowing for both personal and commercial use.