vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k

timm

Introduction

The ViT_Huge_Patch14_CLIP_336.LAION2B_FT_IN12K_IN1K is a Vision Transformer (ViT) model designed for image classification tasks. Pretrained using OpenCLIP on LAION-2B, it is fine-tuned on ImageNet-12k and ImageNet-1k datasets. This model is part of the timm library and is optimized for high performance in image classification and feature extraction.

Architecture

  • Model Type: Image classification / feature backbone
  • Parameters: 632.5 million
  • GMACs: 363.7
  • Activations: 213.4 million
  • Image Size: 336 x 336
  • Datasets: ImageNet-1k, LAION-2B, ImageNet-12k

Key papers associated with this model include OpenCLIP, LAION-5B, and Vision Transformers for image recognition.

Training

The ViT model is pretrained on a large-scale dataset (LAION-2B) using a contrastive language-image learning approach, then fine-tuned on ImageNet-12k and ImageNet-1k. This multi-stage training enhances its ability to generalize across varied image classification tasks.

Guide: Running Locally

Basic Steps

  1. Install Dependencies: Ensure you have Python and PyTorch installed.
  2. Install timm Library:
    pip install timm
    
  3. Load the Model:
    import timm
    model = timm.create_model('vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k', pretrained=True)
    model = model.eval()
    
  4. Prepare the Input: Use PIL to open and preprocess images.
  5. Run Inference: Transform images and pass them through the model to obtain predictions or embeddings.

Cloud GPUs

For optimal performance, especially with large models like ViT, consider using cloud-based GPU services such as AWS, GCP, or Azure.

License

This model is licensed under the Apache-2.0 License, which allows for both personal and commercial use with attribution.

More Related APIs in Image Classification