msmarco bert base dot v5

sentence-transformers

Introduction

The MSMARCO-BERT-BASE-DOT-V5 is a SentenceTransformers model designed for semantic search, mapping sentences to a 768-dimensional vector space. It is trained on 500K query-answer pairs from the MS MARCO dataset, providing advanced tools for sentence similarity and feature extraction.

Architecture

The model incorporates a BERT base architecture with a mean pooling layer:

  • Transformer Layer: Utilizes bert-base-uncased with a max sequence length of 512.
  • Pooling Layer: Employs mean pooling of token embeddings to generate sentence embeddings.

Training

The model is trained using the following parameters:

  • DataLoader: Batch size of 64 over 7858 iterations.
  • Loss Function: MarginMSELoss from SentenceTransformers.
  • Optimizer: AdamW with a learning rate of 1e-5 and weight decay of 0.01.
  • Epochs: Trained over 30 epochs with a warmup step count of 10000.

Guide: Running Locally

To run this model locally, you need Python and the SentenceTransformers library.

Basic Steps

  1. Install SentenceTransformers:

    pip install -U sentence-transformers
    
  2. Load and Use the Model:

    from sentence_transformers import SentenceTransformer, util
    
    model = SentenceTransformer('sentence-transformers/msmarco-bert-base-dot-v5')
    query = "How many people live in London?"
    docs = ["Around 9 Million people live in London", "London is known for its financial district"]
    
    query_emb = model.encode(query)
    doc_emb = model.encode(docs)
    scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
    
    doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
    
  3. Alternative using Transformers Library:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5")
    model = AutoModel.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5")
    
    def encode(texts):
        encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
        with torch.no_grad():
            model_output = model(**encoded_input, return_dict=True)
        return model_output.last_hidden_state.mean(dim=1)
    
    query_emb = encode(query)
    doc_emb = encode(docs)
    scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
    

Cloud GPUs

For enhanced performance, using cloud-based GPU services such as AWS EC2, Google Cloud Platform, or Azure can be beneficial.

License

The MSMARCO-BERT-BASE-DOT-V5 model is provided under the Apache 2.0 License. You are free to use, modify, and distribute it within the terms of the license.

More Related APIs in Sentence Similarity