facebook dpr ctx_encoder multiset base

sentence-transformers

Introduction

The facebook-dpr-ctx_encoder-multiset-base is a model from Sentence Transformers based on Facebook's DPR, designed for mapping sentences and paragraphs into a 768-dimensional dense vector space. This model can be used for tasks such as clustering and semantic search.

Architecture

The model architecture consists of two main components:

  • Transformer: Uses a BertModel with a maximum sequence length of 509 and does not convert text to lowercase.
  • Pooling: Supports CLS token pooling with a word embedding dimension of 768.

Training

This model is a port of the Dense Passage Retrieval (DPR) model by Facebook Research, tailored for sentence-transformers usage. It is designed to generate high-quality sentence embeddings for various NLP tasks.

Guide: Running Locally

Basic Steps

  1. Install Sentence Transformers:

    pip install -U sentence-transformers
    
  2. Using the Model:

    from sentence_transformers import SentenceTransformer
    
    sentences = ["This is an example sentence", "Each sentence is converted"]
    model = SentenceTransformer('sentence-transformers/facebook-dpr-ctx_encoder-multiset-base')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Using Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def cls_pooling(model_output, attention_mask):
        return model_output[0][:,0]
    
    sentences = ['This is an example sentence', 'Each sentence is converted']
    tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/facebook-dpr-ctx_encoder-multiset-base')
    model = AutoModel.from_pretrained('sentence-transformers/facebook-dpr-ctx_encoder-multiset-base')
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    

Cloud GPUs

For more efficient computation, consider utilizing cloud GPU services such as AWS, Google Cloud, or Azure for running the model.

License

The model is distributed under the Apache 2.0 License.

More Related APIs in Sentence Similarity