Vision Transformer (ViT-Base-Patch32-224-IN21K)

Introduction

The Vision Transformer (ViT) model is a transformer encoder model designed for image recognition, pre-trained on the ImageNet-21k dataset, which contains 14 million images across 21,843 classes. It was introduced in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al. The model processes images as sequences of fixed-size patches and uses a BERT-like architecture.

Architecture

ViT uses a transformer encoder architecture, where images are divided into patches, embedded linearly, and processed through multiple layers of transformers. A [CLS] token is added for classification tasks. The model's pre-trained pooler can be utilized for downstream tasks, such as image classification, by adding a linear layer on top of the [CLS] token's last hidden state.

Training

The ViT model was pre-trained on the ImageNet-21k dataset, utilizing TPUv3 hardware with 8 cores, a batch size of 4096, and a learning rate warmup of 10k steps. Images were preprocessed to 224x224 resolution, normalized with a mean of (0.5, 0.5, 0.5) and a standard deviation of (0.5, 0.5, 0.5). Gradient clipping at a global norm of 1 was applied during training. Evaluation results indicated improved performance with increased resolution and model size.

Guide: Running Locally

To run the ViT model locally using PyTorch, follow these steps:

Install the required libraries:

pip install transformers
pip install torch
pip install pillow

Load and process an image:

from transformers import ViTImageProcessor, ViTModel
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = ViTImageProcessor.from_pretrained('google/vit-base-patch32-224-in21k')
model = ViTModel.from_pretrained('google/vit-base-patch32-224-in21k')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state

For enhanced performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

The ViT model is licensed under the Apache-2.0 License.