nomic embed vision v1.5

nomic-ai

Introduction

The nomic-embed-vision-v1.5 model is a high-performing vision embedding model designed to operate within the same embedding space as the nomic-embed-text-v1.5. It is part of the Nomic Embed series, which has expanded to become multimodal, allowing for seamless integration of text and image data.

Architecture

The model is based on a transformer architecture and is implemented using libraries like transformers and onnx. It is specifically optimized for image feature extraction tasks. The embeddings created by this model can be used for various applications, including multimodal retrieval scenarios.

Training

The training process involves aligning the vision embeddings with text embeddings using a method similar to the LiT approach (referenced in arXiv:2111.07991). The text embedder remains locked during this alignment. The training code is available in the Contrastors repository.

Guide: Running Locally

To run the nomic-embed-vision-v1.5 model locally, follow these steps:

  1. Install Required Libraries:

    • Ensure you have torch, transformers, PIL, and requests installed. Use pip for installation:
      pip install torch transformers pillow requests
      
  2. Load the Model and Processor:

    • Use the transformers library to load the model and image processor:
      from transformers import AutoImageProcessor, AutoModel
      processor = AutoImageProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1.5")
      vision_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1.5", trust_remote_code=True)
      
  3. Process an Image:

    • Download an image and process it to obtain embeddings:
      from PIL import Image
      import requests
      url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
      image = Image.open(requests.get(url, stream=True).raw)
      inputs = processor(image, return_tensors="pt")
      
  4. Generate Embeddings:

    • Compute and normalize embeddings:
      import torch.nn.functional as F
      img_emb = vision_model(**inputs).last_hidden_state
      img_embeddings = F.normalize(img_emb[:, 0], p=2, dim=1)
      
  5. Suggested Cloud GPUs:

    • To enhance performance, consider using cloud GPUs available on platforms like AWS, Google Cloud, or Azure.

License

The nomic-embed-vision-v1.5 model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. This license allows for sharing and adaptation for non-commercial purposes, provided appropriate credit is given.

More Related APIs in Image Feature Extraction