msmarco bert base dot v5
sentence-transformersIntroduction
The MSMARCO-BERT-BASE-DOT-V5 is a SentenceTransformers model designed for semantic search, mapping sentences to a 768-dimensional vector space. It is trained on 500K query-answer pairs from the MS MARCO dataset, providing advanced tools for sentence similarity and feature extraction.
Architecture
The model incorporates a BERT base architecture with a mean pooling layer:
- Transformer Layer: Utilizes
bert-base-uncased
with a max sequence length of 512. - Pooling Layer: Employs mean pooling of token embeddings to generate sentence embeddings.
Training
The model is trained using the following parameters:
- DataLoader: Batch size of 64 over 7858 iterations.
- Loss Function: MarginMSELoss from SentenceTransformers.
- Optimizer: AdamW with a learning rate of 1e-5 and weight decay of 0.01.
- Epochs: Trained over 30 epochs with a warmup step count of 10000.
Guide: Running Locally
To run this model locally, you need Python and the SentenceTransformers library.
Basic Steps
-
Install SentenceTransformers:
pip install -U sentence-transformers
-
Load and Use the Model:
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('sentence-transformers/msmarco-bert-base-dot-v5') query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] query_emb = model.encode(query) doc_emb = model.encode(docs) scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
-
Alternative using Transformers Library:
from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5") model = AutoModel.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5") def encode(texts): encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input, return_dict=True) return model_output.last_hidden_state.mean(dim=1) query_emb = encode(query) doc_emb = encode(docs) scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
Cloud GPUs
For enhanced performance, using cloud-based GPU services such as AWS EC2, Google Cloud Platform, or Azure can be beneficial.
License
The MSMARCO-BERT-BASE-DOT-V5 model is provided under the Apache 2.0 License. You are free to use, modify, and distribute it within the terms of the license.