vit_large_patch14_clip_224.openai_ft_in12k_in1k

timm

Introduction

The Vision Transformer (ViT) vit_large_patch14_clip_224.openai_ft_in12k_in1k is an image classification model. It was pretrained on WIT-400M image-text pairs using OpenAI's CLIP and fine-tuned on ImageNet-12k and ImageNet-1k datasets. The model is part of the PyTorch Image Models (timm) library, optimized for image classification tasks.

Architecture

  • Model Type: Image classification / feature backbone
  • Parameters: 304.2 million
  • GMACs: 77.8
  • Activations: 57.1 million
  • Image Size: 224 x 224
  • Datasets: ImageNet-1k, WIT-400M, ImageNet-12k

Training

The model uses a transformer architecture optimized for image recognition, leveraging natural language supervision from CLIP. It follows scaling laws for better performance, as detailed in related research papers.

Guide: Running Locally

Basic Steps

  1. Install Required Libraries:

    pip install timm
    
  2. Load and Preprocess Image:

    from urllib.request import urlopen
    from PIL import Image
    import timm
    
    img = Image.open(urlopen('URL_TO_IMAGE'))
    
  3. Load the Model:

    model = timm.create_model('vit_large_patch14_clip_224.openai_ft_in12k_in1k', pretrained=True)
    model = model.eval()
    
  4. Apply Transformations:

    data_config = timm.data.resolve_model_data_config(model)
    transforms = timm.data.create_transform(**data_config, is_training=False)
    
  5. Perform Inference:

    output = model(transforms(img).unsqueeze(0))
    
  6. Retrieve Top 5 Classifications:

    import torch
    top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
    

Cloud GPUs

Consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure for faster processing, especially for large-scale inference tasks.

License

This model is licensed under the Apache-2.0 License.

More Related APIs in Image Classification