multi qa distilbert cos v1
sentence-transformersIntroduction
The multi-qa-distilbert-cos-v1
is a sentence-transformers model designed for semantic search, mapping sentences and paragraphs to a 768-dimensional vector space. It has been trained on 215 million question-answer pairs from diverse datasets. This model facilitates tasks such as semantic search and sentence similarity.
Architecture
This model is based on the distilbert-base-uncased
architecture, with adaptations for semantic search tasks. It uses mean pooling to produce normalized embeddings, making dot-product and cosine-similarity equivalent due to the normalized length of embeddings. The model is trained to encode queries and text paragraphs into a dense vector space for efficient document retrieval.
Training
The model underwent fine-tuning with a large, diverse set of datasets using a self-supervised contrastive learning objective. It was trained on 215 million question-answer pairs with datasets such as MS MARCO, PAQ, and WikiAnswers. The training utilized the MultipleNegativesRankingLoss
with mean pooling and cosine similarity. The pre-training involved the use of the distilbert-base-uncased
model.
Guide: Running Locally
To use this model locally, you can follow these steps:
-
Install the Sentence Transformers library:
pip install -U sentence-transformers
-
Load and use the model:
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('sentence-transformers/multi-qa-distilbert-cos-v1') query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] # Encode query and documents query_emb = model.encode(query) doc_emb = model.encode(docs) # Compute dot score between query and document embeddings scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() # Combine and sort documents by score doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True) # Output passages and scores for doc, score in doc_score_pairs: print(score, doc)
-
Alternative using Transformers library:
from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F # Define mean pooling function def mean_pooling(model_output, attention_mask): token_embeddings = model_output.last_hidden_state input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-distilbert-cos-v1") model = AutoModel.from_pretrained("sentence-transformers/multi-qa-distilbert-cos-v1") # Encode and score query_emb = encode(query) doc_emb = encode(docs) scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist() # Output sorted results doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True) for doc, score in doc_score_pairs: print(score, doc)
For enhanced performance, consider using cloud GPU services like AWS, GCP, or Azure.
License
The multi-qa-distilbert-cos-v1
model is distributed under the Apache License 2.0.