Introduction

CLIP-ViT-L-14 is an Image & Text model that maps text and images to a shared vector space. It is part of the Sentence Transformers library, which is used for tasks like sentence similarity and feature extraction. The model is designed to support applications such as image search, zero-shot image classification, image clustering, and image deduplication.

Architecture

The CLIP-ViT-L-14 model uses the Vision Transformer (ViT) architecture combined with CLIP (Contrastive Language–Image Pretraining) to enable image and text representations in a shared vector space. This allows for efficient comparison and combination of visual and textual data.

Training

The model has been trained to achieve high performance on tasks involving image and text data. It demonstrates a zero-shot ImageNet validation set accuracy of 75.4%. There are other variants of the model with different performance metrics, such as clip-ViT-B-32 and clip-ViT-B-16. Additionally, a multilingual version supports 50+ languages, providing broader compatibility for diverse linguistic tasks.

Guide: Running Locally

To use the CLIP-ViT-L-14 model locally, follow these steps:

  1. Install the Sentence Transformers library:

    pip install sentence-transformers
    
  2. Load and use the model in Python:

    from sentence_transformers import SentenceTransformer, util
    from PIL import Image
    
    # Load CLIP model
    model = SentenceTransformer('clip-ViT-L-14')
    
    # Encode an image
    img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))
    
    # Encode text descriptions
    text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
    
    # Compute cosine similarities
    cos_scores = util.cos_sim(img_emb, text_emb)
    print(cos_scores)
    
  3. Consider using cloud GPUs: For better performance, especially with large datasets or complex tasks, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure.

License

The usage and distribution of the CLIP-ViT-L-14 model are subject to the terms and conditions specified in its license. Refer to the Sentence Transformers repository for detailed licensing information.

More Related APIs in Sentence Similarity