msmarco bert base dot v5 LLM Model

Introduction

The MSMARCO-BERT-BASE-DOT-V5 is a SentenceTransformers model designed for semantic search, mapping sentences to a 768-dimensional vector space. It is trained on 500K query-answer pairs from the MS MARCO dataset, providing advanced tools for sentence similarity and feature extraction.

Architecture

The model incorporates a BERT base architecture with a mean pooling layer:

Transformer Layer: Utilizes bert-base-uncased with a max sequence length of 512.
Pooling Layer: Employs mean pooling of token embeddings to generate sentence embeddings.

Training

The model is trained using the following parameters:

DataLoader: Batch size of 64 over 7858 iterations.
Loss Function: MarginMSELoss from SentenceTransformers.
Optimizer: AdamW with a learning rate of 1e-5 and weight decay of 0.01.
Epochs: Trained over 30 epochs with a warmup step count of 10000.

Guide: Running Locally

To run this model locally, you need Python and the SentenceTransformers library.

Basic Steps

Install SentenceTransformers:
```
pip install -U sentence-transformers
```

Load and Use the Model:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sentence-transformers/msmarco-bert-base-dot-v5')
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

query_emb = model.encode(query)
doc_emb = model.encode(docs)
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

doc_score_pairs = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)

Alternative using Transformers Library:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5")
model = AutoModel.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5")

def encode(texts):
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)
    return model_output.last_hidden_state.mean(dim=1)

query_emb = encode(query)
doc_emb = encode(docs)
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

Cloud GPUs

For enhanced performance, using cloud-based GPU services such as AWS EC2, Google Cloud Platform, or Azure can be beneficial.

License

The MSMARCO-BERT-BASE-DOT-V5 model is provided under the Apache 2.0 License. You are free to use, modify, and distribute it within the terms of the license.

More Related APIs in Sentence Similarity