Introduction

The CLIP-ViT-B-16 model is an Image & Text model that maps text and images to a shared vector space. It is part of the Sentence Transformers library, designed for tasks involving sentence similarity and feature extraction.

Architecture

CLIP-ViT-B-16 is built on the CLIP architecture, which integrates vision and language by mapping both modalities into a shared vector space. This allows for various applications such as image search, zero-shot image classification, clustering, and deduplication.

Training

The CLIP models, including CLIP-ViT-B-16, achieve high performance on zero-shot tasks. The model's top-1 accuracy on the ImageNet validation set is 68.1%. A multilingual version is also available, supporting over 50 languages.

Guide: Running Locally

To use the CLIP-ViT-B-16 model locally, follow these steps:

  1. Install the Sentence Transformers library:
    pip install sentence-transformers
    
  2. Load the model and encode images and text:
    from sentence_transformers import SentenceTransformer, util
    from PIL import Image
    
    # Load CLIP model
    model = SentenceTransformer('clip-ViT-B-16')
    
    # Encode an image
    img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))
    
    # Encode text descriptions
    text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
    
    # Compute cosine similarities
    cos_scores = util.cos_sim(img_emb, text_emb)
    print(cos_scores)
    
  3. Explore additional examples: Check the SBERT documentation for more use cases like image search and classification.

Note

For high-performance needs, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The use of CLIP-ViT-B-16 is subject to the licensing terms provided by Hugging Face and the Sentence Transformers library. Users should ensure compliance with these licenses when deploying and using the model.

More Related APIs in Sentence Similarity