vit base patch16 224
googleVision Transformer (ViT) Base Model
Introduction
The Vision Transformer (ViT) is a model that employs transformer encoders, similar to BERT, for image classification tasks. It is pretrained on the ImageNet-21k dataset and fine-tuned on ImageNet 2012. This model is particularly effective for image classification, using a sequence of image patches as input.
Architecture
ViT treats images as sequences of patches (16x16 pixels each) instead of pixels. Each patch is linearly embedded, and a classification token ([CLS]) is prepended to the sequence. The model uses absolute position embeddings and a transformer encoder to process these sequences, enabling it to learn representations useful for various image classification tasks.
Training
Training Data
- Pretraining Dataset: ImageNet-21k with 14 million images and 21,843 classes.
- Fine-tuning Dataset: ImageNet (1 million images, 1,000 classes).
Training Procedure
- Hardware: Trained on TPUv3 with 8 cores.
- Batch Size: 4096.
- Learning Rate: Warmup over 10,000 steps.
- Gradient Clipping: Applied at a global norm of 1 for ImageNet.
- Resolution: 224x224 during training, with better results at 384x384 for fine-tuning.
Guide: Running Locally
- Setup Environment: Ensure you have Python and
transformers
library installed. - Load Model and Processor:
from transformers import ViTImageProcessor, ViTForImageClassification from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224') model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224') inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) predicted_class_idx = outputs.logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx])
- Cloud GPUs: Consider using cloud services like AWS, GCP, or Azure for access to GPUs for efficient processing.
License
This model is licensed under the Apache 2.0 License, allowing for both academic and commercial use.