xlm r base en ko nli ststb

sentence-transformers

Introduction

The sentence-transformers/xlm-r-base-en-ko-nli-ststb model is part of the Sentence Transformers library designed for mapping sentences and paragraphs to a 768-dimensional dense vector space. It can be used for tasks such as clustering or semantic search. However, this model is deprecated due to low-quality sentence embeddings. Users are advised to refer to recommended models on SBERT.net.

Architecture

The model employs a SentenceTransformer architecture with two main components:

  1. Transformer: Uses an XLM-RobertaModel with a maximum sequence length of 128.
  2. Pooling: Configured to apply mean pooling on tokens, transforming embeddings into sentence embeddings.

Training

This model was trained under the Sentence Transformers framework. The training process involved using a Siamese BERT-Networks approach to generate sentence embeddings, as detailed in the publication "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" by Nils Reimers and Iryna Gurevych.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install the Sentence Transformers library:

    pip install -U sentence-transformers
    
  2. Load and use the model with Sentence Transformers:

    from sentence_transformers import SentenceTransformer
    sentences = ["This is an example sentence", "Each sentence is converted"]
    model = SentenceTransformer('sentence-transformers/xlm-r-base-en-ko-nli-ststb')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Alternatively, use Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ['This is an example sentence', 'Each sentence is converted']
    tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/xlm-r-base-en-ko-nli-ststb')
    model = AutoModel.from_pretrained('sentence-transformers/xlm-r-base-en-ko-nli-ststb')
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    
  4. Cloud GPUs:

    • Consider using cloud services such as AWS, Google Cloud, or Azure for GPU support to speed up processing, especially for large datasets.

License

The model is released under the Apache 2.0 License.

More Related APIs in Sentence Similarity