paraphrase xlm r multilingual v1
sentence-transformersIntroduction
The Paraphrase-XLM-R-Multilingual-V1 is a model from the Sentence-Transformers library that transforms sentences and paragraphs into a dense 768-dimensional vector space. It is useful for tasks like clustering and semantic search. It supports multiple libraries including PyTorch, TensorFlow, and ONNX, and is suitable for various sentence similarity and feature extraction tasks.
Architecture
The model is a part of the SentenceTransformer class, which includes:
- Transformer: Utilizing the XLMRobertaModel with a maximum sequence length of 128 without lowercasing.
- Pooling: Configured to use mean pooling across token embeddings, with a word embedding dimension of 768.
Training
The model was developed using the Sentence-Transformers framework. It is based on the Siamese BERT-Networks concept and includes mean pooling to create sentence embeddings. For additional details, you can refer to the publication: "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks."
Guide: Running Locally
Basic Steps
-
Install Sentence-Transformers:
pip install -U sentence-transformers
-
Using the Model:
from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('sentence-transformers/paraphrase-xlm-r-multilingual-v1') embeddings = model.encode(sentences) print(embeddings)
-
Without Sentence-Transformers:
from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ['This is an example sentence', 'Each sentence is converted'] tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-xlm-r-multilingual-v1') model = AutoModel.from_pretrained('sentence-transformers/paraphrase-xlm-r-multilingual-v1') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
Cloud GPU
For large-scale tasks or optimal performance, consider using cloud GPUs such as those available from AWS, GCP, or Azure.
License
This model is licensed under the Apache 2.0 License, allowing for both academic and commercial use.