msmarco distilbert base tas b
sentence-transformersIntroduction
The MSMARCO-DistilBERT-Base-TAS-B model is a version of the DistilBERT TAS-B model adapted for the sentence-transformers library. It is designed to map sentences and paragraphs to a 768-dimensional dense vector space, optimized for semantic search tasks.
Architecture
The model architecture includes:
- A Transformer model using DistilBERT with a maximum sequence length of 512 and no lower-casing.
- A pooling layer that uses the CLS token for sentence embedding, with a word embedding dimension of 768.
Training
The model is trained on the MS MARCO dataset and fine-tuned for the task of semantic search, enabling it to generate meaningful sentence embeddings for similarity tasks.
Guide: Running Locally
To use this model, you can follow these steps:
-
Install Dependencies:
pip install -U sentence-transformers
-
Load and Use the Model:
from sentence_transformers import SentenceTransformer, util # Load the model model = SentenceTransformer('sentence-transformers/msmarco-distilbert-base-tas-b') # Encode sentences query = "How many people live in London?" docs = ["Around 9 Million people live in London", "London is known for its financial district"] query_emb = model.encode(query) doc_emb = model.encode(docs) # Compute similarity scores scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist() # Output results for doc, score in sorted(zip(docs, scores), key=lambda x: x[1], reverse=True): print(score, doc)
-
Alternative Method Using Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b") model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b") # Encode text def encode(texts): encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input, return_dict=True) return model_output.last_hidden_state[:,0] # Calculate scores query_emb = encode(query) doc_emb = encode(docs) scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist() # Output results for doc, score in sorted(zip(docs, scores), key=lambda x: x[1], reverse=True): print(score, doc)
-
Cloud GPUs: To improve performance, consider using cloud-based GPU services like AWS EC2, Google Cloud, or Azure.
License
This model is licensed under the Apache-2.0 license. For more details, refer to the license documentation.