clip vit large patch14 336

openai

CLIP-VIT-LARGE-PATCH14-336

Introduction

The CLIP-VIT-LARGE-PATCH14-336 model is designed for zero-shot image classification. It is compatible with both PyTorch and TensorFlow frameworks and integrates with the Hugging Face Transformers library.

Architecture

The model's architecture is based on the Vision Transformer (ViT) approach, using a patch size of 14x14 pixels. It is suitable for tasks involving image and text understanding. However, specific architectural details are not provided in the documentation.

Training

The model was trained from scratch, although the dataset used remains unspecified. Training hyperparameters include:

  • Optimizer: None specified
  • Training precision: float32

The training utilized the following framework versions:

  • Transformers 4.21.3
  • TensorFlow 2.8.2
  • Tokenizers 0.12.1

Guide: Running Locally

  1. Installation: Ensure you have Python and the necessary libraries installed. You can install the Hugging Face Transformers library via pip:

    pip install transformers
    
  2. Usage: Load the model using the Transformers library:

    from transformers import CLIPProcessor, CLIPModel
    
    model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14-336")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
    
  3. Inference: Prepare images and text for inference to perform zero-shot classification.

  4. Cloud GPUs: For enhanced performance and speed, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.

License

The license information for this model is not explicitly provided. Users should consult the Hugging Face model card or contact support for licensing details.

More Related APIs in Zero Shot Image Classification