clip vit large patch14
openaiCLIP ViT-Large-Patch14 Model Documentation
Introduction
The CLIP (Contrastive Language-Image Pretraining) model developed by OpenAI is designed for robust computer vision tasks, enabling generalization to arbitrary image classification tasks in a zero-shot manner. It primarily serves as a research tool to help understand and explore these capabilities within AI research communities.
Architecture
CLIP uses a ViT-L/14 Transformer architecture for its image encoder and a masked self-attention Transformer for its text encoder. These encoders are trained using a contrastive loss to maximize the similarity between image-text pairs.
Training
The model was trained on publicly available image-caption datasets obtained from the internet, including YFCC100M. The training data reflects demographics that are more connected to the internet, often skewing towards younger, male users from developed nations. The dataset was not intended for commercial use, and content was filtered to exclude excessively violent and adult images.
Guide: Running Locally
Steps
-
Install Dependencies: Ensure you have Python and PyTorch installed. Use pip to install the Hugging Face Transformers library.
pip install transformers
-
Load the Model: Use the Hugging Face Transformers library to load the pre-trained CLIP model.
from transformers import CLIPProcessor, CLIPModel model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
-
Process Input: Prepare an image and text for the model.
from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
-
Run Inference: Evaluate the model's output.
outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1)
Cloud GPUs
For faster inference and training, consider using cloud GPU services such as AWS, Google Cloud, or Azure.
License
The CLIP model and its documentation are subject to OpenAI's terms and conditions. The dataset used for training is not intended for commercial use and was gathered under specific guidelines to filter inappropriate content.