msmarco bert co condensor

sentence-transformers

Introduction

The msmarco-bert-co-condensor model is a port of the Luyu/co-condenser-marco-retriever, designed for mapping sentences and paragraphs into a 768-dimensional dense vector space. It is optimized for semantic search tasks and is based on the paper "Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval."

Architecture

The model uses a DistilBERT-based architecture with a specific setup in the SentenceTransformer framework. The model consists of a Transformer component with a maximum sequence length of 256 tokens, followed by pooling that uses the CLS token for sentence embeddings. The configuration does not lower case the input text.

Training

The model is trained on the MS MARCO dataset, which involves retrieving relevant passages for a given query. However, the training process involves some information leakage due to the inclusion of document titles from a different benchmark, which affects the model's reported performance scores. This leakage potentially inflates the scores by allowing the model to associate the presence of titles with relevance.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Sentence Transformers:

    pip install -U sentence-transformers
    
  2. Load and Use the Model:

    from sentence_transformers import SentenceTransformer, util
    
    model = SentenceTransformer('sentence-transformers/msmarco-bert-co-condensor')
    query = "How many people live in London?"
    docs = ["Around 9 Million people live in London", "London is known for its financial district"]
    
    query_emb = model.encode(query)
    doc_emb = model.encode(docs)
    scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
    
    doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
    for doc, score in doc_score_pairs:
        print(score, doc)
    
  3. Using Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-co-condensor")
    model = AutoModel.from_pretrained("sentence-transformers/msmarco-bert-co-condensor")
    
    def encode(texts):
        encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
        with torch.no_grad():
            model_output = model(**encoded_input, return_dict=True)
        return model_output.last_hidden_state[:,0]
    
    query_emb = encode(query)
    doc_emb = encode(docs)
    scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
    
    doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
    for doc, score in doc_score_pairs:
        print(score, doc)
    
  4. Cloud GPUs: For large-scale or high-performance requirements, consider using cloud GPUs from providers like AWS, GCP, or Azure for faster processing.

License

The msmarco-bert-co-condensor model is licensed under the Apache 2.0 License, allowing for both commercial and non-commercial use.

More Related APIs in Sentence Similarity