clip Vi T B 16
sentence-transformersIntroduction
The CLIP-ViT-B-16 model is an Image & Text model that maps text and images to a shared vector space. It is part of the Sentence Transformers library, designed for tasks involving sentence similarity and feature extraction.
Architecture
CLIP-ViT-B-16 is built on the CLIP architecture, which integrates vision and language by mapping both modalities into a shared vector space. This allows for various applications such as image search, zero-shot image classification, clustering, and deduplication.
Training
The CLIP models, including CLIP-ViT-B-16, achieve high performance on zero-shot tasks. The model's top-1 accuracy on the ImageNet validation set is 68.1%. A multilingual version is also available, supporting over 50 languages.
Guide: Running Locally
To use the CLIP-ViT-B-16 model locally, follow these steps:
- Install the Sentence Transformers library:
pip install sentence-transformers
- Load the model and encode images and text:
from sentence_transformers import SentenceTransformer, util from PIL import Image # Load CLIP model model = SentenceTransformer('clip-ViT-B-16') # Encode an image img_emb = model.encode(Image.open('two_dogs_in_snow.jpg')) # Encode text descriptions text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night']) # Compute cosine similarities cos_scores = util.cos_sim(img_emb, text_emb) print(cos_scores)
- Explore additional examples: Check the SBERT documentation for more use cases like image search and classification.
Note
For high-performance needs, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The use of CLIP-ViT-B-16 is subject to the licensing terms provided by Hugging Face and the Sentence Transformers library. Users should ensure compliance with these licenses when deploying and using the model.