Introduction

The CLIP-ViT-B-32 model is part of the sentence-transformers library, designed to map both text and images to a shared vector space. This model is useful for various applications including image search, zero-shot image classification, image clustering, and image deduplication.

Architecture

CLIP-ViT-B-32 leverages a Vision Transformer (ViT) architecture to process images and text, enabling cross-modal similarity tasks. It is based on the CLIP model, which is documented in the paper CLIP: Connecting Text and Images.

Training

While specific training details for CLIP-ViT-B-32 are not provided, the model is trained to understand and align text and image data in a shared space, which allows it to perform tasks like zero-shot learning effectively.

Guide: Running Locally

  1. Install Dependencies: Ensure Python is installed and use pip to install the sentence-transformers library:
    pip install sentence-transformers
    
  2. Load the Model: Use the following Python code to load and run the model:
    from sentence_transformers import SentenceTransformer, util
    from PIL import Image
    
    # Load CLIP model
    model = SentenceTransformer('clip-ViT-B-32')
    
    # Encode an image
    img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))
    
    # Encode text descriptions
    text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
    
    # Compute cosine similarities
    cos_scores = util.cos_sim(img_emb, text_emb)
    print(cos_scores)
    
  3. Consider Cloud GPUs: For faster processing, especially with large datasets or intensive tasks, consider using cloud-based GPU services such as AWS EC2, Google Cloud, or Azure.

License

The code and models are provided under the Apache 2.0 License, allowing both personal and commercial use with proper attribution. For more details, refer to the licensing information on Hugging Face's website.

More Related APIs in Sentence Similarity