clip vit base patch32
openaiIntroduction
The CLIP model, developed by OpenAI, is designed to enhance robustness in computer vision tasks and generalize to arbitrary image classification tasks in a zero-shot manner. It is primarily intended for research purposes, helping researchers understand model capabilities, biases, and constraints.
Architecture
CLIP employs a ViT-B/32 Transformer architecture for image encoding and a masked self-attention Transformer for text encoding. These encoders are optimized to maximize the similarity of (image, text) pairs using contrastive loss. The repository hosts a variant with the Vision Transformer.
Training
The model was trained on publicly available image-caption data, sourced from internet crawling and pre-existing datasets like YFCC100M. This data is more representative of internet-connected societies, which tend to be from more developed nations. The dataset is not released for commercial use.
Guide: Running Locally
- Setup: Install the necessary packages, including
transformers
,PIL
, andrequests
. - Load Model and Processor:
from transformers import CLIPProcessor, CLIPModel model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
- Process Input:
from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
- Run Model:
outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1)
- Cloud GPUs: To facilitate faster processing, consider using cloud GPU services like AWS, GCP, or Azure.
License
The model is intended for research purposes and not for any deployed or commercial use. It should not be applied to surveillance or facial recognition tasks, and its application should be limited to English language use cases.