clip vit base patch16

openai

Introduction

The CLIP model, developed by OpenAI, aims to enhance the robustness of computer vision models and explore their generalization capabilities for zero-shot image classification tasks. It is a research-focused tool designed to aid understanding and interdisciplinary studies of AI's potential impacts.

Architecture

CLIP uses a ViT-B/16 Transformer architecture for image encoding and a masked self-attention Transformer for text encoding. The encoders are trained to maximize the similarity of image-text pairs using contrastive loss. This repository features the Vision Transformer variant.

Training

CLIP was trained on publicly available image-caption data, derived from a combination of web crawling and established datasets like YFCC100M. The data predominantly represents internet-active demographics, potentially skewing towards developed nations and younger, male users. The dataset emphasizes robustness and generalizability in computer vision tasks but is not intended for commercial or deployed models.

Guide: Running Locally

To use CLIP with the Transformers library:

from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

Suggested Cloud GPUs

For efficient local deployment and experimentation, consider using cloud platforms like AWS, Google Cloud, or Azure, which offer GPU instances suitable for deep learning tasks.

License

The CLIP model and associated resources are intended for research purposes. Users should ensure thorough testing and validation within specific contexts before deploying any derivatives. The dataset is not released for commercial use.

More Related APIs in Zero Shot Image Classification