multilingual e5 base

intfloat

Introduction
The multilingual-e5-base model is a text embedding model designed for sentence similarity tasks. It supports 94 languages and is based on xlm-roberta. The model is fine-tuned using various datasets for classification, retrieval, clustering, and other NLP tasks.

Architecture
The model comprises 12 layers with an embedding size of 768. It is initialized from xlm-roberta-base, supporting 100 languages. The model uses a contrastive pre-training method followed by supervised fine-tuning on several labeled datasets.

Training
The training process consists of two stages:

  1. Contrastive pre-training with weak supervision using datasets like mC4, CC News, Wikipedia, and others.
  2. Supervised fine-tuning with datasets such as MS MARCO, NQ, Trivia QA, and several others across multiple languages. The model is trained with texts being prefixed as "query: " or "passage: " to maintain consistency with training data.

Guide: Running Locally

  1. Install the required packages:
    pip install transformers torch sentence_transformers
    
  2. Load the model and tokenizer:
    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')
    model = AutoModel.from_pretrained('intfloat/multilingual-e5-base')
    
  3. Tokenize and encode the input texts:
    inputs = tokenizer(['query: example text'], return_tensors='pt', padding=True, truncation=True)
    outputs = model(**inputs)
    
  4. For efficient processing, use cloud GPUs like AWS, Google Cloud, or Azure to handle large datasets and complex models.

License
The multilingual-e5-base model is licensed under the MIT License, allowing for wide use and modification.

More Related APIs in Sentence Similarity