vit base patch32 224 in21k

google

Vision Transformer (ViT-Base-Patch32-224-IN21K)

Introduction

The Vision Transformer (ViT) model is a transformer encoder model designed for image recognition, pre-trained on the ImageNet-21k dataset, which contains 14 million images across 21,843 classes. It was introduced in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al. The model processes images as sequences of fixed-size patches and uses a BERT-like architecture.

Architecture

ViT uses a transformer encoder architecture, where images are divided into patches, embedded linearly, and processed through multiple layers of transformers. A [CLS] token is added for classification tasks. The model's pre-trained pooler can be utilized for downstream tasks, such as image classification, by adding a linear layer on top of the [CLS] token's last hidden state.

Training

The ViT model was pre-trained on the ImageNet-21k dataset, utilizing TPUv3 hardware with 8 cores, a batch size of 4096, and a learning rate warmup of 10k steps. Images were preprocessed to 224x224 resolution, normalized with a mean of (0.5, 0.5, 0.5) and a standard deviation of (0.5, 0.5, 0.5). Gradient clipping at a global norm of 1 was applied during training. Evaluation results indicated improved performance with increased resolution and model size.

Guide: Running Locally

To run the ViT model locally using PyTorch, follow these steps:

  1. Install the required libraries:

    pip install transformers
    pip install torch
    pip install pillow
    
  2. Load and process an image:

    from transformers import ViTImageProcessor, ViTModel
    from PIL import Image
    import requests
    
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    
    processor = ViTImageProcessor.from_pretrained('google/vit-base-patch32-224-in21k')
    model = ViTModel.from_pretrained('google/vit-base-patch32-224-in21k')
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    last_hidden_state = outputs.last_hidden_state
    
  3. For enhanced performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

The ViT model is licensed under the Apache-2.0 License.

More Related APIs in Image Feature Extraction