msmarco distilbert dot v5
sentence-transformersIntroduction
The MSMARCO-DISTILBERT-DOT-V5 is a model from the Sentence-Transformers library that maps sentences and paragraphs to a 768-dimensional dense vector space, optimized for semantic search. It has been trained on 500K (query, answer) pairs from the MS MARCO dataset.
Architecture
The model architecture includes a transformer model, specifically DistilBertModel, combined with a pooling layer. The pooling layer uses mean pooling on the token embeddings to produce sentence embeddings. It supports a maximum sequence length of 512 and outputs 768-dimensional embeddings.
Training
The training process utilized a DataLoader with a batch size of 64 and employed the MarginMSELoss
loss function. The model was trained for 30 epochs using the AdamW optimizer with a learning rate of 1e-05. The training process included a warmup phase with 10,000 steps and applied weight decay of 0.01.
Guide: Running Locally
To use the MSMARCO-DISTILBERT-DOT-V5 model locally, follow these steps:
-
Install the Sentence-Transformers library:
pip install -U sentence-transformers
-
Use the model with Sentence-Transformers:
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('sentence-transformers/msmarco-distilbert-dot-v5') query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] query_emb = model.encode(query) doc_emb = model.encode(docs) scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
-
Use the model with Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-dot-v5") model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-dot-v5") def encode(texts): encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input, return_dict=True) embeddings = mean_pooling(model_output, encoded_input['attention_mask']) return embeddings
-
Consider using cloud GPUs for faster processing, such as those provided by AWS or Google Cloud.
License
The MSMARCO-DISTILBERT-DOT-V5 model is released under the Apache 2.0 license. However, it was trained on the MS MARCO dataset, which has its own terms and conditions.