multi qa distilbert cos v1

sentence-transformers

Introduction

The multi-qa-distilbert-cos-v1 is a sentence-transformers model designed for semantic search, mapping sentences and paragraphs to a 768-dimensional vector space. It has been trained on 215 million question-answer pairs from diverse datasets. This model facilitates tasks such as semantic search and sentence similarity.

Architecture

This model is based on the distilbert-base-uncased architecture, with adaptations for semantic search tasks. It uses mean pooling to produce normalized embeddings, making dot-product and cosine-similarity equivalent due to the normalized length of embeddings. The model is trained to encode queries and text paragraphs into a dense vector space for efficient document retrieval.

Training

The model underwent fine-tuning with a large, diverse set of datasets using a self-supervised contrastive learning objective. It was trained on 215 million question-answer pairs with datasets such as MS MARCO, PAQ, and WikiAnswers. The training utilized the MultipleNegativesRankingLoss with mean pooling and cosine similarity. The pre-training involved the use of the distilbert-base-uncased model.

Guide: Running Locally

To use this model locally, you can follow these steps:

  1. Install the Sentence Transformers library:

    pip install -U sentence-transformers
    
  2. Load and use the model:

    from sentence_transformers import SentenceTransformer, util
    
    model = SentenceTransformer('sentence-transformers/multi-qa-distilbert-cos-v1')
    query = "How many people live in London?"
    docs = ["Around 9 Million people live in London", "London is known for its financial district"]
    
    # Encode query and documents
    query_emb = model.encode(query)
    doc_emb = model.encode(docs)
    
    # Compute dot score between query and document embeddings
    scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
    
    # Combine and sort documents by score
    doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
    
    # Output passages and scores
    for doc, score in doc_score_pairs:
        print(score, doc)
    
  3. Alternative using Transformers library:

    from transformers import AutoTokenizer, AutoModel
    import torch
    import torch.nn.functional as F
    
    # Define mean pooling function
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output.last_hidden_state
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-distilbert-cos-v1")
    model = AutoModel.from_pretrained("sentence-transformers/multi-qa-distilbert-cos-v1")
    
    # Encode and score
    query_emb = encode(query)
    doc_emb = encode(docs)
    scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
    
    # Output sorted results
    doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
    for doc, score in doc_score_pairs:
        print(score, doc)
    

For enhanced performance, consider using cloud GPU services like AWS, GCP, or Azure.

License

The multi-qa-distilbert-cos-v1 model is distributed under the Apache License 2.0.

More Related APIs in Sentence Similarity