clip vit large patch14

openai

CLIP ViT-Large-Patch14 Model Documentation

Introduction

The CLIP (Contrastive Language-Image Pretraining) model developed by OpenAI is designed for robust computer vision tasks, enabling generalization to arbitrary image classification tasks in a zero-shot manner. It primarily serves as a research tool to help understand and explore these capabilities within AI research communities.

Architecture

CLIP uses a ViT-L/14 Transformer architecture for its image encoder and a masked self-attention Transformer for its text encoder. These encoders are trained using a contrastive loss to maximize the similarity between image-text pairs.

Training

The model was trained on publicly available image-caption datasets obtained from the internet, including YFCC100M. The training data reflects demographics that are more connected to the internet, often skewing towards younger, male users from developed nations. The dataset was not intended for commercial use, and content was filtered to exclude excessively violent and adult images.

Guide: Running Locally

Steps

  1. Install Dependencies: Ensure you have Python and PyTorch installed. Use pip to install the Hugging Face Transformers library.

    pip install transformers
    
  2. Load the Model: Use the Hugging Face Transformers library to load the pre-trained CLIP model.

    from transformers import CLIPProcessor, CLIPModel
    model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
    
  3. Process Input: Prepare an image and text for the model.

    from PIL import Image
    import requests
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
    
  4. Run Inference: Evaluate the model's output.

    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)
    

Cloud GPUs

For faster inference and training, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

The CLIP model and its documentation are subject to OpenAI's terms and conditions. The dataset used for training is not intended for commercial use and was gathered under specific guidelines to filter inappropriate content.

More Related APIs in Zero Shot Image Classification