Introduction

The all-MiniLM-L12-v1 model by Sentence-Transformers is designed to map sentences and paragraphs to a 384-dimensional dense vector space, making it suitable for tasks such as clustering and semantic search. It is implemented using the sentence-transformers library and can be easily integrated with Hugging Face Transformers.

Architecture

The model is based on the microsoft/MiniLM-L12-H384-uncased architecture, which has been fine-tuned using a self-supervised contrastive learning objective. It efficiently encodes input text into meaningful sentence embeddings that capture semantic information.

Training

Pre-Training

The model utilizes the microsoft/MiniLM-L12-H384-uncased as its base architecture, pre-trained on extensive sentence-level datasets.

Fine-Tuning

Fine-tuning was conducted using contrastive learning on over 1 billion sentence pairs, focusing on maximizing the cosine similarity between true pairs. Training was performed on a TPU v3-8 using a batch size of 1024 and a learning rate of 2e-5, employing the AdamW optimizer. The input sequence was limited to 128 tokens.

Training Data

The model was trained using a diverse set of datasets, including Reddit comments, S2ORC citation pairs, WikiAnswers, and others, summing up to over 1 billion sentence pairs.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Dependencies:

    • For Sentence-Transformers:
      pip install -U sentence-transformers
      
    • For Hugging Face Transformers:
      pip install torch transformers
      
  2. Using Sentence-Transformers:

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v1')
    sentences = ["This is an example sentence", "Each sentence is converted"]
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Using Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch.nn.functional as F
    
    tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L12-v1')
    model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L12-v1')
    
    sentences = ["This is an example sentence", "Each sentence is converted"]
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    print(sentence_embeddings)
    
  4. Cloud GPUs: For enhanced performance, consider using cloud GPUs from providers like AWS, GCP, or Azure, which can handle larger datasets and provide faster computation speeds.

License

The model is licensed under the Apache 2.0 License, allowing for both personal and commercial use.

More Related APIs in Sentence Similarity