vit_large_patch14_clip_224.openai_ft_in12k_in1k
timmIntroduction
The Vision Transformer (ViT) vit_large_patch14_clip_224.openai_ft_in12k_in1k
is an image classification model. It was pretrained on WIT-400M image-text pairs using OpenAI's CLIP and fine-tuned on ImageNet-12k and ImageNet-1k datasets. The model is part of the PyTorch Image Models (timm) library, optimized for image classification tasks.
Architecture
- Model Type: Image classification / feature backbone
- Parameters: 304.2 million
- GMACs: 77.8
- Activations: 57.1 million
- Image Size: 224 x 224
- Datasets: ImageNet-1k, WIT-400M, ImageNet-12k
Training
The model uses a transformer architecture optimized for image recognition, leveraging natural language supervision from CLIP. It follows scaling laws for better performance, as detailed in related research papers.
Guide: Running Locally
Basic Steps
-
Install Required Libraries:
pip install timm
-
Load and Preprocess Image:
from urllib.request import urlopen from PIL import Image import timm img = Image.open(urlopen('URL_TO_IMAGE'))
-
Load the Model:
model = timm.create_model('vit_large_patch14_clip_224.openai_ft_in12k_in1k', pretrained=True) model = model.eval()
-
Apply Transformations:
data_config = timm.data.resolve_model_data_config(model) transforms = timm.data.create_transform(**data_config, is_training=False)
-
Perform Inference:
output = model(transforms(img).unsqueeze(0))
-
Retrieve Top 5 Classifications:
import torch top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
Cloud GPUs
Consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure for faster processing, especially for large-scale inference tasks.
License
This model is licensed under the Apache-2.0 License.