vit base patch32 224 in21k
googleVision Transformer (ViT-Base-Patch32-224-IN21K)
Introduction
The Vision Transformer (ViT) model is a transformer encoder model designed for image recognition, pre-trained on the ImageNet-21k dataset, which contains 14 million images across 21,843 classes. It was introduced in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al. The model processes images as sequences of fixed-size patches and uses a BERT-like architecture.
Architecture
ViT uses a transformer encoder architecture, where images are divided into patches, embedded linearly, and processed through multiple layers of transformers. A [CLS] token is added for classification tasks. The model's pre-trained pooler can be utilized for downstream tasks, such as image classification, by adding a linear layer on top of the [CLS] token's last hidden state.
Training
The ViT model was pre-trained on the ImageNet-21k dataset, utilizing TPUv3 hardware with 8 cores, a batch size of 4096, and a learning rate warmup of 10k steps. Images were preprocessed to 224x224 resolution, normalized with a mean of (0.5, 0.5, 0.5) and a standard deviation of (0.5, 0.5, 0.5). Gradient clipping at a global norm of 1 was applied during training. Evaluation results indicated improved performance with increased resolution and model size.
Guide: Running Locally
To run the ViT model locally using PyTorch, follow these steps:
-
Install the required libraries:
pip install transformers pip install torch pip install pillow
-
Load and process an image:
from transformers import ViTImageProcessor, ViTModel from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) processor = ViTImageProcessor.from_pretrained('google/vit-base-patch32-224-in21k') model = ViTModel.from_pretrained('google/vit-base-patch32-224-in21k') inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state
-
For enhanced performance, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
The ViT model is licensed under the Apache-2.0 License.