clip vit large patch14 336
openaiCLIP-VIT-LARGE-PATCH14-336
Introduction
The CLIP-VIT-LARGE-PATCH14-336 model is designed for zero-shot image classification. It is compatible with both PyTorch and TensorFlow frameworks and integrates with the Hugging Face Transformers library.
Architecture
The model's architecture is based on the Vision Transformer (ViT) approach, using a patch size of 14x14 pixels. It is suitable for tasks involving image and text understanding. However, specific architectural details are not provided in the documentation.
Training
The model was trained from scratch, although the dataset used remains unspecified. Training hyperparameters include:
- Optimizer: None specified
- Training precision: float32
The training utilized the following framework versions:
- Transformers 4.21.3
- TensorFlow 2.8.2
- Tokenizers 0.12.1
Guide: Running Locally
-
Installation: Ensure you have Python and the necessary libraries installed. You can install the Hugging Face Transformers library via pip:
pip install transformers
-
Usage: Load the model using the Transformers library:
from transformers import CLIPProcessor, CLIPModel model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14-336") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
-
Inference: Prepare images and text for inference to perform zero-shot classification.
-
Cloud GPUs: For enhanced performance and speed, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.
License
The license information for this model is not explicitly provided. Users should consult the Hugging Face model card or contact support for licensing details.