multilingual e5 base
intfloatIntroduction
The multilingual-e5-base model is a text embedding model designed for sentence similarity tasks. It supports 94 languages and is based on xlm-roberta. The model is fine-tuned using various datasets for classification, retrieval, clustering, and other NLP tasks.
Architecture
The model comprises 12 layers with an embedding size of 768. It is initialized from xlm-roberta-base, supporting 100 languages. The model uses a contrastive pre-training method followed by supervised fine-tuning on several labeled datasets.
Training
The training process consists of two stages:
- Contrastive pre-training with weak supervision using datasets like mC4, CC News, Wikipedia, and others.
- Supervised fine-tuning with datasets such as MS MARCO, NQ, Trivia QA, and several others across multiple languages. The model is trained with texts being prefixed as "query: " or "passage: " to maintain consistency with training data.
Guide: Running Locally
- Install the required packages:
pip install transformers torch sentence_transformers
- Load the model and tokenizer:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base') model = AutoModel.from_pretrained('intfloat/multilingual-e5-base')
- Tokenize and encode the input texts:
inputs = tokenizer(['query: example text'], return_tensors='pt', padding=True, truncation=True) outputs = model(**inputs)
- For efficient processing, use cloud GPUs like AWS, Google Cloud, or Azure to handle large datasets and complex models.
License
The multilingual-e5-base model is licensed under the MIT License, allowing for wide use and modification.