paraphrase multilingual Mini L M L12 v2

sentence-transformers

Introduction

The paraphrase-multilingual-MiniLM-L12-v2 model from Sentence Transformers is designed to map sentences and paragraphs into a 384-dimensional dense vector space. This model is suitable for tasks such as clustering and semantic search. It supports multiple languages, making it versatile for various multilingual applications.

Architecture

The model is a SentenceTransformer that comprises a transformer layer with a BERT model and a pooling layer. The transformer layer handles the contextualized word embeddings, and the pooling layer computes sentence embeddings using mean pooling over the token embeddings.

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Training

The model was developed by the Sentence Transformers team, leveraging the Siamese BERT-Networks architecture. This approach enables effective sentence embeddings, evaluated through the Sentence Embeddings Benchmark.

Guide: Running Locally

Basic Steps

  1. Installation: Ensure you have the sentence-transformers or transformers library installed:

    pip install -U sentence-transformers
    
  2. Using Sentence Transformers:

    from sentence_transformers import SentenceTransformer
    
    sentences = ["This is an example sentence", "Each sentence is converted"]
    model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Using Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ['This is an example sentence', 'Each sentence is converted']
    tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
    model = AutoModel.from_pretrained('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    

Cloud GPUs

For faster processing and handling larger datasets, consider using cloud GPU services like AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.

License

The paraphrase-multilingual-MiniLM-L12-v2 model is licensed under the Apache 2.0 License. This permissive license allows for both personal and commercial use, modification, and distribution.

More Related APIs in Sentence Similarity