msmarco distilbert base tas b

sentence-transformers

Introduction

The MSMARCO-DistilBERT-Base-TAS-B model is a version of the DistilBERT TAS-B model adapted for the sentence-transformers library. It is designed to map sentences and paragraphs to a 768-dimensional dense vector space, optimized for semantic search tasks.

Architecture

The model architecture includes:

  • A Transformer model using DistilBERT with a maximum sequence length of 512 and no lower-casing.
  • A pooling layer that uses the CLS token for sentence embedding, with a word embedding dimension of 768.

Training

The model is trained on the MS MARCO dataset and fine-tuned for the task of semantic search, enabling it to generate meaningful sentence embeddings for similarity tasks.

Guide: Running Locally

To use this model, you can follow these steps:

  1. Install Dependencies:

    pip install -U sentence-transformers
    
  2. Load and Use the Model:

    from sentence_transformers import SentenceTransformer, util
    
    # Load the model
    model = SentenceTransformer('sentence-transformers/msmarco-distilbert-base-tas-b')
    
    # Encode sentences
    query = "How many people live in London?"
    docs = ["Around 9 Million people live in London", "London is known for its financial district"]
    query_emb = model.encode(query)
    doc_emb = model.encode(docs)
    
    # Compute similarity scores
    scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
    
    # Output results
    for doc, score in sorted(zip(docs, scores), key=lambda x: x[1], reverse=True):
        print(score, doc)
    
  3. Alternative Method Using Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
    model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
    
    # Encode text
    def encode(texts):
        encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
        with torch.no_grad():
            model_output = model(**encoded_input, return_dict=True)
        return model_output.last_hidden_state[:,0]
    
    # Calculate scores
    query_emb = encode(query)
    doc_emb = encode(docs)
    scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
    
    # Output results
    for doc, score in sorted(zip(docs, scores), key=lambda x: x[1], reverse=True):
        print(score, doc)
    
  4. Cloud GPUs: To improve performance, consider using cloud-based GPU services like AWS EC2, Google Cloud, or Azure.

License

This model is licensed under the Apache-2.0 license. For more details, refer to the license documentation.

More Related APIs in Sentence Similarity