msmarco bert co condensor
sentence-transformersIntroduction
The msmarco-bert-co-condensor
model is a port of the Luyu/co-condenser-marco-retriever
, designed for mapping sentences and paragraphs into a 768-dimensional dense vector space. It is optimized for semantic search tasks and is based on the paper "Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval."
Architecture
The model uses a DistilBERT-based architecture with a specific setup in the SentenceTransformer framework. The model consists of a Transformer component with a maximum sequence length of 256 tokens, followed by pooling that uses the CLS token for sentence embeddings. The configuration does not lower case the input text.
Training
The model is trained on the MS MARCO dataset, which involves retrieving relevant passages for a given query. However, the training process involves some information leakage due to the inclusion of document titles from a different benchmark, which affects the model's reported performance scores. This leakage potentially inflates the scores by allowing the model to associate the presence of titles with relevance.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Sentence Transformers:
pip install -U sentence-transformers
-
Load and Use the Model:
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('sentence-transformers/msmarco-bert-co-condensor') query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] query_emb = model.encode(query) doc_emb = model.encode(docs) scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True) for doc, score in doc_score_pairs: print(score, doc)
-
Using Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-co-condensor") model = AutoModel.from_pretrained("sentence-transformers/msmarco-bert-co-condensor") def encode(texts): encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input, return_dict=True) return model_output.last_hidden_state[:,0] query_emb = encode(query) doc_emb = encode(docs) scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist() doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True) for doc, score in doc_score_pairs: print(score, doc)
-
Cloud GPUs: For large-scale or high-performance requirements, consider using cloud GPUs from providers like AWS, GCP, or Azure for faster processing.
License
The msmarco-bert-co-condensor
model is licensed under the Apache 2.0 License, allowing for both commercial and non-commercial use.