clip Vi T L 14
sentence-transformersIntroduction
CLIP-ViT-L-14 is an Image & Text model that maps text and images to a shared vector space. It is part of the Sentence Transformers library, which is used for tasks like sentence similarity and feature extraction. The model is designed to support applications such as image search, zero-shot image classification, image clustering, and image deduplication.
Architecture
The CLIP-ViT-L-14 model uses the Vision Transformer (ViT) architecture combined with CLIP (Contrastive Language–Image Pretraining) to enable image and text representations in a shared vector space. This allows for efficient comparison and combination of visual and textual data.
Training
The model has been trained to achieve high performance on tasks involving image and text data. It demonstrates a zero-shot ImageNet validation set accuracy of 75.4%. There are other variants of the model with different performance metrics, such as clip-ViT-B-32 and clip-ViT-B-16. Additionally, a multilingual version supports 50+ languages, providing broader compatibility for diverse linguistic tasks.
Guide: Running Locally
To use the CLIP-ViT-L-14 model locally, follow these steps:
-
Install the Sentence Transformers library:
pip install sentence-transformers
-
Load and use the model in Python:
from sentence_transformers import SentenceTransformer, util from PIL import Image # Load CLIP model model = SentenceTransformer('clip-ViT-L-14') # Encode an image img_emb = model.encode(Image.open('two_dogs_in_snow.jpg')) # Encode text descriptions text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night']) # Compute cosine similarities cos_scores = util.cos_sim(img_emb, text_emb) print(cos_scores)
-
Consider using cloud GPUs: For better performance, especially with large datasets or complex tasks, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure.
License
The usage and distribution of the CLIP-ViT-L-14 model are subject to the terms and conditions specified in its license. Refer to the Sentence Transformers repository for detailed licensing information.