clip vit base patch32

openai

Introduction

The CLIP model, developed by OpenAI, is designed to enhance robustness in computer vision tasks and generalize to arbitrary image classification tasks in a zero-shot manner. It is primarily intended for research purposes, helping researchers understand model capabilities, biases, and constraints.

Architecture

CLIP employs a ViT-B/32 Transformer architecture for image encoding and a masked self-attention Transformer for text encoding. These encoders are optimized to maximize the similarity of (image, text) pairs using contrastive loss. The repository hosts a variant with the Vision Transformer.

Training

The model was trained on publicly available image-caption data, sourced from internet crawling and pre-existing datasets like YFCC100M. This data is more representative of internet-connected societies, which tend to be from more developed nations. The dataset is not released for commercial use.

Guide: Running Locally

  1. Setup: Install the necessary packages, including transformers, PIL, and requests.
  2. Load Model and Processor:
    from transformers import CLIPProcessor, CLIPModel
    model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    
  3. Process Input:
    from PIL import Image
    import requests
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
    
  4. Run Model:
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)
    
  5. Cloud GPUs: To facilitate faster processing, consider using cloud GPU services like AWS, GCP, or Azure.

License

The model is intended for research purposes and not for any deployed or commercial use. It should not be applied to surveillance or facial recognition tasks, and its application should be limited to English language use cases.

More Related APIs in Zero Shot Image Classification