msmarco distilbert dot v5

sentence-transformers

Introduction

The MSMARCO-DISTILBERT-DOT-V5 is a model from the Sentence-Transformers library that maps sentences and paragraphs to a 768-dimensional dense vector space, optimized for semantic search. It has been trained on 500K (query, answer) pairs from the MS MARCO dataset.

Architecture

The model architecture includes a transformer model, specifically DistilBertModel, combined with a pooling layer. The pooling layer uses mean pooling on the token embeddings to produce sentence embeddings. It supports a maximum sequence length of 512 and outputs 768-dimensional embeddings.

Training

The training process utilized a DataLoader with a batch size of 64 and employed the MarginMSELoss loss function. The model was trained for 30 epochs using the AdamW optimizer with a learning rate of 1e-05. The training process included a warmup phase with 10,000 steps and applied weight decay of 0.01.

Guide: Running Locally

To use the MSMARCO-DISTILBERT-DOT-V5 model locally, follow these steps:

  1. Install the Sentence-Transformers library:

    pip install -U sentence-transformers
    
  2. Use the model with Sentence-Transformers:

    from sentence_transformers import SentenceTransformer, util
    
    model = SentenceTransformer('sentence-transformers/msmarco-distilbert-dot-v5')
    query = "How many people live in London?"
    docs = ["Around 9 Million people live in London", "London is known for its financial district"]
    query_emb = model.encode(query)
    doc_emb = model.encode(docs)
    scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
    
  3. Use the model with Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-dot-v5")
    model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-dot-v5")
    
    def encode(texts):
        encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
        with torch.no_grad():
            model_output = model(**encoded_input, return_dict=True)
        embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
        return embeddings
    
  4. Consider using cloud GPUs for faster processing, such as those provided by AWS or Google Cloud.

License

The MSMARCO-DISTILBERT-DOT-V5 model is released under the Apache 2.0 license. However, it was trained on the MS MARCO dataset, which has its own terms and conditions.

More Related APIs in Sentence Similarity