vit large patch16 224 in21k
googleIntroduction
The Vision Transformer (ViT) is a large-sized model designed for image recognition tasks. Pre-trained on the ImageNet-21k dataset, it processes images as sequences of patches, leveraging transformer architectures to extract features for classification and other vision tasks.
Architecture
The ViT model is a transformer encoder, similar to BERT, utilizing sequences of 16x16 image patches linearly embedded for processing. A [CLS] token is added for classification purposes, and absolute position embeddings are incorporated to maintain spatial information. The model contains a pre-trained pooler but lacks fine-tuned heads, making it suitable for feature extraction and downstream task adaptation.
Training
Training Data
ViT was pre-trained on the ImageNet-21k dataset, comprising 14 million images across 21,843 classes.
Training Procedure
The model was trained on TPUv3 hardware, with a batch size of 4096 and a learning rate warmup of 10k steps. Images were preprocessed by resizing to 224x224 resolution and normalizing RGB channels. Gradient clipping with a global norm of 1 was applied during training to improve results. Evaluation on image classification benchmarks suggested better performance at higher resolutions and model sizes.
Guide: Running Locally
To run the ViT model locally:
- Install Dependencies: Ensure you have the
transformers
library andPIL
for image processing. - Load the Model:
from transformers import ViTImageProcessor, ViTModel from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) processor = ViTImageProcessor.from_pretrained('google/vit-large-patch16-224-in21k') model = ViTModel.from_pretrained('google/vit-large-patch16-224-in21k') inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state
- Execution Environment: Running on a local machine with a powerful CPU or GPU is recommended. For larger models or datasets, consider using cloud GPU services like AWS, Google Cloud, or Azure for better performance.
License
The Vision Transformer model is released under the Apache-2.0 License, allowing for broad use in both commercial and non-commercial applications.