paraphrase xlm r multilingual v1

sentence-transformers

Introduction

The Paraphrase-XLM-R-Multilingual-V1 is a model from the Sentence-Transformers library that transforms sentences and paragraphs into a dense 768-dimensional vector space. It is useful for tasks like clustering and semantic search. It supports multiple libraries including PyTorch, TensorFlow, and ONNX, and is suitable for various sentence similarity and feature extraction tasks.

Architecture

The model is a part of the SentenceTransformer class, which includes:

  • Transformer: Utilizing the XLMRobertaModel with a maximum sequence length of 128 without lowercasing.
  • Pooling: Configured to use mean pooling across token embeddings, with a word embedding dimension of 768.

Training

The model was developed using the Sentence-Transformers framework. It is based on the Siamese BERT-Networks concept and includes mean pooling to create sentence embeddings. For additional details, you can refer to the publication: "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks."

Guide: Running Locally

Basic Steps

  1. Install Sentence-Transformers:

    pip install -U sentence-transformers
    
  2. Using the Model:

    from sentence_transformers import SentenceTransformer
    
    sentences = ["This is an example sentence", "Each sentence is converted"]
    model = SentenceTransformer('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Without Sentence-Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ['This is an example sentence', 'Each sentence is converted']
    tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
    model = AutoModel.from_pretrained('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    

Cloud GPU

For large-scale tasks or optimal performance, consider using cloud GPUs such as those available from AWS, GCP, or Azure.

License

This model is licensed under the Apache 2.0 License, allowing for both academic and commercial use.

More Related APIs in Sentence Similarity