all Mini L M L12 v1
sentence-transformersIntroduction
The all-MiniLM-L12-v1
model by Sentence-Transformers is designed to map sentences and paragraphs to a 384-dimensional dense vector space, making it suitable for tasks such as clustering and semantic search. It is implemented using the sentence-transformers
library and can be easily integrated with Hugging Face Transformers.
Architecture
The model is based on the microsoft/MiniLM-L12-H384-uncased
architecture, which has been fine-tuned using a self-supervised contrastive learning objective. It efficiently encodes input text into meaningful sentence embeddings that capture semantic information.
Training
Pre-Training
The model utilizes the microsoft/MiniLM-L12-H384-uncased
as its base architecture, pre-trained on extensive sentence-level datasets.
Fine-Tuning
Fine-tuning was conducted using contrastive learning on over 1 billion sentence pairs, focusing on maximizing the cosine similarity between true pairs. Training was performed on a TPU v3-8 using a batch size of 1024 and a learning rate of 2e-5, employing the AdamW optimizer. The input sequence was limited to 128 tokens.
Training Data
The model was trained using a diverse set of datasets, including Reddit comments, S2ORC citation pairs, WikiAnswers, and others, summing up to over 1 billion sentence pairs.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Dependencies:
- For Sentence-Transformers:
pip install -U sentence-transformers
- For Hugging Face Transformers:
pip install torch transformers
- For Sentence-Transformers:
-
Using Sentence-Transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v1') sentences = ["This is an example sentence", "Each sentence is converted"] embeddings = model.encode(sentences) print(embeddings)
-
Using Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch.nn.functional as F tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L12-v1') model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L12-v1') sentences = ["This is an example sentence", "Each sentence is converted"] encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) print(sentence_embeddings)
-
Cloud GPUs: For enhanced performance, consider using cloud GPUs from providers like AWS, GCP, or Azure, which can handle larger datasets and provide faster computation speeds.
License
The model is licensed under the Apache 2.0 License, allowing for both personal and commercial use.