Introduction

The all-MiniLM-L6-v2 model is part of the Sentence Transformers project, mapping sentences and paragraphs to a 384-dimensional dense vector space. It is optimized for tasks like clustering and semantic search.

Architecture

The model is based on a MiniLM architecture and is specifically designed to encode sentences and short paragraphs into dense vectors. It utilizes a 384-dimensional space for these embeddings, making it efficient for various natural language processing tasks.

Training

Pre-training

The model was initialized with the nreimers/MiniLM-L6-H384-uncased pre-trained model. It was further trained on a massive dataset comprising over 1 billion sentence pairs using a self-supervised contrastive learning objective.

Fine-tuning

The model was fine-tuned with a contrastive learning approach, calculating cosine similarities between sentence pairs and applying cross-entropy loss for optimization. Training involved using a TPU v3-8 for 100k steps with a batch size of 1024, employing the AdamW optimizer at a 2e-5 learning rate.

Training Data

The fine-tuning data consisted of a diverse set of datasets, including Reddit comments, S2ORC citation pairs, WikiAnswers, PAQ, and others, totaling over 1.17 billion sentence pairs.

Guide: Running Locally

Basic Steps

  1. Installation:

    • Install the Sentence Transformers library:
      pip install -U sentence-transformers
      
  2. Usage:

    • Use the model with Sentence Transformers:

      from sentence_transformers import SentenceTransformer
      model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
      sentences = ["This is an example sentence", "Each sentence is converted"]
      embeddings = model.encode(sentences)
      print(embeddings)
      
    • Alternatively, use Hugging Face Transformers:

      from transformers import AutoTokenizer, AutoModel
      import torch.nn.functional as F
      tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
      model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
      # Tokenize, encode and process as shown in the detailed usage instructions.
      

Cloud GPUs

For optimal performance, especially with large datasets, consider using cloud-based GPUs like those available from AWS, Google Cloud, or Azure.

License

The all-MiniLM-L6-v2 model is licensed under the Apache-2.0 License.

More Related APIs in Sentence Similarity