Introduction

The sentence-transformers/all-mpnet-base-v1 model maps sentences and paragraphs to a 768-dimensional dense vector space, useful for clustering and semantic search tasks. It supports sentence similarity tasks and is compatible with multiple frameworks like PyTorch, ONNX, and others.

Architecture

The model is based on the pretrained microsoft/mpnet-base architecture, fine-tuned using a contrastive learning objective on a large dataset of 1 billion sentence pairs. The architecture facilitates encoding sentences into vectors capturing semantic information, supporting tasks like information retrieval and clustering.

Training

Pre-training

The base model microsoft/mpnet-base was pre-trained on diverse datasets. Details about the pre-training process can be found in the model card on Hugging Face.

Fine-tuning

The model was fine-tuned with a contrastive learning objective on a dataset of over 1 billion sentence pairs. The fine-tuning process utilized 7 TPU v3-8 units and involved a batch size of 512 with a learning rate of 2e-5. The training data was sourced from multiple datasets, with configurations detailed in a data_config.json file.

Guide: Running Locally

Basic Steps

  1. Install Dependencies:

    pip install -U sentence-transformers
    
  2. Usage with sentence-transformers:

    from sentence_transformers import SentenceTransformer
    
    sentences = ["This is an example sentence", "Each sentence is converted"]
    model = SentenceTransformer('sentence-transformers/all-mpnet-base-v1')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Usage with transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    import torch.nn.functional as F
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ['This is an example sentence', 'Each sentence is converted']
    tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v1')
    model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v1')
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    print("Sentence embeddings:")
    print(sentence_embeddings)
    

Suggestion: Cloud GPUs

Consider using cloud platforms such as AWS, GCP, or Azure for accessing GPUs to accelerate model inference and training.

License

This model is licensed under the Apache 2.0 License.

More Related APIs in Sentence Similarity