paraphrase multilingual Mini L M L12 v2
sentence-transformersIntroduction
The paraphrase-multilingual-MiniLM-L12-v2
model from Sentence Transformers is designed to map sentences and paragraphs into a 384-dimensional dense vector space. This model is suitable for tasks such as clustering and semantic search. It supports multiple languages, making it versatile for various multilingual applications.
Architecture
The model is a SentenceTransformer that comprises a transformer layer with a BERT model and a pooling layer. The transformer layer handles the contextualized word embeddings, and the pooling layer computes sentence embeddings using mean pooling over the token embeddings.
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
Training
The model was developed by the Sentence Transformers team, leveraging the Siamese BERT-Networks architecture. This approach enables effective sentence embeddings, evaluated through the Sentence Embeddings Benchmark.
Guide: Running Locally
Basic Steps
-
Installation: Ensure you have the
sentence-transformers
ortransformers
library installed:pip install -U sentence-transformers
-
Using Sentence Transformers:
from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2') embeddings = model.encode(sentences) print(embeddings)
-
Using Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ['This is an example sentence', 'Each sentence is converted'] tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2') model = AutoModel.from_pretrained('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
Cloud GPUs
For faster processing and handling larger datasets, consider using cloud GPU services like AWS EC2 with NVIDIA GPUs, Google Cloud Platform, or Azure.
License
The paraphrase-multilingual-MiniLM-L12-v2
model is licensed under the Apache 2.0 License. This permissive license allows for both personal and commercial use, modification, and distribution.