vit huge patch14 224 in21k
googleIntroduction
The Vision Transformer (ViT) is a model developed by Google for image recognition, leveraging a transformer encoder architecture. It was pre-trained on the ImageNet-21k dataset, which includes 14 million images across 21,843 classes. The model processes images as sequences of fixed-size patches and is designed to provide robust feature extraction capabilities for downstream tasks such as image classification.
Architecture
ViT employs a transformer encoder, similar to BERT, tailored for image processing. Images are divided into 16x16 patches, linearly embedded, and supplemented with absolute position embeddings and a classification token ([CLS]) for classification tasks. The model does not include fine-tuned heads but retains a pre-trained pooler for feature extraction.
Training
The model was pre-trained on the ImageNet-21k dataset using TPUv3 hardware with 8 cores. Training involved a batch size of 4096, learning rate warmup over 10,000 steps, and gradient clipping at a global norm of 1. Images were preprocessed to a resolution of 224x224, normalized across RGB channels, and resized as necessary. Evaluation indicates that larger model sizes and higher resolution (384x384) during fine-tuning improve performance.
Guide: Running Locally
To run the ViT model locally, follow these steps:
-
Install Transformers Library:
pip install transformers
-
Load the Model and Feature Extractor:
from transformers import ViTFeatureExtractor, ViTModel from PIL import Image import requests # Load image url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) # Load pre-trained model and feature extractor feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-huge-patch14-224-in21k') model = ViTModel.from_pretrained('google/vit-huge-patch14-224-in21k') # Prepare inputs inputs = feature_extractor(images=image, return_tensors="pt") outputs = model(**inputs) last_hidden_states = outputs.last_hidden_state
-
Cloud GPUs: For enhanced performance, consider using cloud services like AWS EC2, Google Cloud Platform, or Azure that provide GPU instances.
License
The Vision Transformer model is released under the Apache 2.0 license.