multi qa distilbert cos v1 LLM Model

Introduction

The multi-qa-distilbert-cos-v1 is a sentence-transformers model designed for semantic search, mapping sentences and paragraphs to a 768-dimensional vector space. It has been trained on 215 million question-answer pairs from diverse datasets. This model facilitates tasks such as semantic search and sentence similarity.

Architecture

This model is based on the distilbert-base-uncased architecture, with adaptations for semantic search tasks. It uses mean pooling to produce normalized embeddings, making dot-product and cosine-similarity equivalent due to the normalized length of embeddings. The model is trained to encode queries and text paragraphs into a dense vector space for efficient document retrieval.

Training

The model underwent fine-tuning with a large, diverse set of datasets using a self-supervised contrastive learning objective. It was trained on 215 million question-answer pairs with datasets such as MS MARCO, PAQ, and WikiAnswers. The training utilized the MultipleNegativesRankingLoss with mean pooling and cosine similarity. The pre-training involved the use of the distilbert-base-uncased model.

Guide: Running Locally

To use this model locally, you can follow these steps:

Install the Sentence Transformers library:
```
pip install -U sentence-transformers
```

Load and use the model:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sentence-transformers/multi-qa-distilbert-cos-v1')
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

# Compute dot score between query and document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

# Combine and sort documents by score
doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)

# Output passages and scores
for doc, score in doc_score_pairs:
    print(score, doc)

Alternative using Transformers library:

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Define mean pooling function
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-distilbert-cos-v1")
model = AutoModel.from_pretrained("sentence-transformers/multi-qa-distilbert-cos-v1")

# Encode and score
query_emb = encode(query)
doc_emb = encode(docs)
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

# Output sorted results
doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
for doc, score in doc_score_pairs:
    print(score, doc)

For enhanced performance, consider using cloud GPU services like AWS, GCP, or Azure.

License

The multi-qa-distilbert-cos-v1 model is distributed under the Apache License 2.0.

More Related APIs in Sentence Similarity