nomic embed text v1.5
nomic-aiIntroduction
Nomic-Embed-Text-V1.5 is a text embedding model designed for various NLP tasks. It supports multimodal embeddings, allowing alignment with vision models. This model employs Matryoshka Representation Learning for flexibility in embedding dimensionality.
Architecture
The model is based on a long-context BERT architecture, providing the ability to handle sequences longer than 2048 tokens. It supports both scaling and dimensionality reduction while maintaining performance.
Training
The training process involves a multi-stage pipeline starting with a long-context BERT model. Initially, it undergoes an unsupervised contrastive training phase using weakly related text pairs from various sources such as StackExchange and Quora. This is followed by a fine-tuning stage utilizing high-quality labeled datasets, focusing on data curation and hard-example mining.
Guide: Running Locally
-
Install Dependencies:
- Ensure Python and necessary libraries are installed.
- Use
pip
to install SentenceTransformers and Transformers libraries.
-
Load the Model:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
-
Prepare Data:
- Format input text with appropriate task instruction prefixes, such as
search_document:
orsearch_query:
.
- Format input text with appropriate task instruction prefixes, such as
-
Generate Embeddings:
sentences = ["search_query: What is TSNE?"] embeddings = model.encode(sentences)
-
Run on Cloud GPUs:
- For better performance and faster processing, consider using cloud services like AWS, Google Cloud, or Azure with GPU support.
License
The model is licensed under Apache 2.0, allowing for both commercial and non-commercial use with proper attribution.