vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k
timmIntroduction
The ViT_Huge_Patch14_CLIP_336.LAION2B_FT_IN12K_IN1K is a Vision Transformer (ViT) model designed for image classification tasks. Pretrained using OpenCLIP on LAION-2B, it is fine-tuned on ImageNet-12k and ImageNet-1k datasets. This model is part of the timm
library and is optimized for high performance in image classification and feature extraction.
Architecture
- Model Type: Image classification / feature backbone
- Parameters: 632.5 million
- GMACs: 363.7
- Activations: 213.4 million
- Image Size: 336 x 336
- Datasets: ImageNet-1k, LAION-2B, ImageNet-12k
Key papers associated with this model include OpenCLIP, LAION-5B, and Vision Transformers for image recognition.
Training
The ViT model is pretrained on a large-scale dataset (LAION-2B) using a contrastive language-image learning approach, then fine-tuned on ImageNet-12k and ImageNet-1k. This multi-stage training enhances its ability to generalize across varied image classification tasks.
Guide: Running Locally
Basic Steps
- Install Dependencies: Ensure you have Python and PyTorch installed.
- Install
timm
Library:pip install timm
- Load the Model:
import timm model = timm.create_model('vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k', pretrained=True) model = model.eval()
- Prepare the Input: Use
PIL
to open and preprocess images. - Run Inference: Transform images and pass them through the model to obtain predictions or embeddings.
Cloud GPUs
For optimal performance, especially with large models like ViT, consider using cloud-based GPU services such as AWS, GCP, or Azure.
License
This model is licensed under the Apache-2.0 License, which allows for both personal and commercial use with attribution.