nomic embed text v1.5

nomic-ai

Introduction

Nomic-Embed-Text-V1.5 is a text embedding model designed for various NLP tasks. It supports multimodal embeddings, allowing alignment with vision models. This model employs Matryoshka Representation Learning for flexibility in embedding dimensionality.

Architecture

The model is based on a long-context BERT architecture, providing the ability to handle sequences longer than 2048 tokens. It supports both scaling and dimensionality reduction while maintaining performance.

Training

The training process involves a multi-stage pipeline starting with a long-context BERT model. Initially, it undergoes an unsupervised contrastive training phase using weakly related text pairs from various sources such as StackExchange and Quora. This is followed by a fine-tuning stage utilizing high-quality labeled datasets, focusing on data curation and hard-example mining.

Guide: Running Locally

  1. Install Dependencies:

    • Ensure Python and necessary libraries are installed.
    • Use pip to install SentenceTransformers and Transformers libraries.
  2. Load the Model:

    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
    
  3. Prepare Data:

    • Format input text with appropriate task instruction prefixes, such as search_document: or search_query:.
  4. Generate Embeddings:

    sentences = ["search_query: What is TSNE?"]
    embeddings = model.encode(sentences)
    
  5. Run on Cloud GPUs:

    • For better performance and faster processing, consider using cloud services like AWS, Google Cloud, or Azure with GPU support.

License

The model is licensed under Apache 2.0, allowing for both commercial and non-commercial use with proper attribution.

More Related APIs in Sentence Similarity