msmarco distilbert dot v5 LLM Model

Introduction

The MSMARCO-DISTILBERT-DOT-V5 is a model from the Sentence-Transformers library that maps sentences and paragraphs to a 768-dimensional dense vector space, optimized for semantic search. It has been trained on 500K (query, answer) pairs from the MS MARCO dataset.

Architecture

The model architecture includes a transformer model, specifically DistilBertModel, combined with a pooling layer. The pooling layer uses mean pooling on the token embeddings to produce sentence embeddings. It supports a maximum sequence length of 512 and outputs 768-dimensional embeddings.

Training

The training process utilized a DataLoader with a batch size of 64 and employed the MarginMSELoss loss function. The model was trained for 30 epochs using the AdamW optimizer with a learning rate of 1e-05. The training process included a warmup phase with 10,000 steps and applied weight decay of 0.01.

Guide: Running Locally

To use the MSMARCO-DISTILBERT-DOT-V5 model locally, follow these steps:

Install the Sentence-Transformers library:
```
pip install -U sentence-transformers
```

Use the model with Sentence-Transformers:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sentence-transformers/msmarco-distilbert-dot-v5')
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]
query_emb = model.encode(query)
doc_emb = model.encode(docs)
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

Use the model with Hugging Face Transformers:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-dot-v5")
model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-dot-v5")

def encode(texts):
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    return embeddings

Consider using cloud GPUs for faster processing, such as those provided by AWS or Google Cloud.

License

The MSMARCO-DISTILBERT-DOT-V5 model is released under the Apache 2.0 license. However, it was trained on the MS MARCO dataset, which has its own terms and conditions.

More Related APIs in Sentence Similarity