all Mini L M L6 v2
sentence-transformersIntroduction
The all-MiniLM-L6-v2
model is part of the Sentence Transformers project, mapping sentences and paragraphs to a 384-dimensional dense vector space. It is optimized for tasks like clustering and semantic search.
Architecture
The model is based on a MiniLM architecture and is specifically designed to encode sentences and short paragraphs into dense vectors. It utilizes a 384-dimensional space for these embeddings, making it efficient for various natural language processing tasks.
Training
Pre-training
The model was initialized with the nreimers/MiniLM-L6-H384-uncased
pre-trained model. It was further trained on a massive dataset comprising over 1 billion sentence pairs using a self-supervised contrastive learning objective.
Fine-tuning
The model was fine-tuned with a contrastive learning approach, calculating cosine similarities between sentence pairs and applying cross-entropy loss for optimization. Training involved using a TPU v3-8 for 100k steps with a batch size of 1024, employing the AdamW optimizer at a 2e-5 learning rate.
Training Data
The fine-tuning data consisted of a diverse set of datasets, including Reddit comments, S2ORC citation pairs, WikiAnswers, PAQ, and others, totaling over 1.17 billion sentence pairs.
Guide: Running Locally
Basic Steps
-
Installation:
- Install the Sentence Transformers library:
pip install -U sentence-transformers
- Install the Sentence Transformers library:
-
Usage:
-
Use the model with Sentence Transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') sentences = ["This is an example sentence", "Each sentence is converted"] embeddings = model.encode(sentences) print(embeddings)
-
Alternatively, use Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch.nn.functional as F tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2') model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2') # Tokenize, encode and process as shown in the detailed usage instructions.
-
Cloud GPUs
For optimal performance, especially with large datasets, consider using cloud-based GPUs like those available from AWS, Google Cloud, or Azure.
License
The all-MiniLM-L6-v2
model is licensed under the Apache-2.0 License.