clip vit base patch16
openaiIntroduction
The CLIP model, developed by OpenAI, aims to enhance the robustness of computer vision models and explore their generalization capabilities for zero-shot image classification tasks. It is a research-focused tool designed to aid understanding and interdisciplinary studies of AI's potential impacts.
Architecture
CLIP uses a ViT-B/16 Transformer architecture for image encoding and a masked self-attention Transformer for text encoding. The encoders are trained to maximize the similarity of image-text pairs using contrastive loss. This repository features the Vision Transformer variant.
Training
CLIP was trained on publicly available image-caption data, derived from a combination of web crawling and established datasets like YFCC100M. The data predominantly represents internet-active demographics, potentially skewing towards developed nations and younger, male users. The dataset emphasizes robustness and generalizability in computer vision tasks but is not intended for commercial or deployed models.
Guide: Running Locally
To use CLIP with the Transformers library:
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
Suggested Cloud GPUs
For efficient local deployment and experimentation, consider using cloud platforms like AWS, Google Cloud, or Azure, which offer GPU instances suitable for deep learning tasks.
License
The CLIP model and associated resources are intended for research purposes. Users should ensure thorough testing and validation within specific contexts before deploying any derivatives. The dataset is not released for commercial use.