vit large patch16 224
googleIntroduction
The Vision Transformer (ViT) is a large-sized model designed for image classification tasks. It was pre-trained on the ImageNet-21k dataset and fine-tuned on ImageNet 2012. The model treats images as sequences of patches, allowing it to leverage the transformer architecture for image recognition.
Architecture
ViT uses a transformer encoder architecture, similar to BERT, tailored for image data. Images are divided into 16x16 patches, which are embedded and processed as sequences. A special [CLS] token is added for classification tasks, and absolute position embeddings are used before feeding the data into the transformer layers.
Training
ViT was pre-trained using the ImageNet-21k dataset, which includes 14 million images and over 21,843 classes. It was then fine-tuned on the ImageNet 2012 dataset with 1 million images and 1,000 classes. Preprocessing includes resizing images to 224x224 and normalizing them. Training was conducted on TPUv3 hardware with a batch size of 4096, and gradient clipping was applied.
Guide: Running Locally
-
Install Dependencies:
- Install the
transformers
library and other dependencies likePIL
for image handling.
- Install the
-
Load the Model:
from transformers import ViTFeatureExtractor, ViTForImageClassification from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-large-patch16-224') model = ViTForImageClassification.from_pretrained('google/vit-large-patch16-224') inputs = feature_extractor(images=image, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx])
-
Consider Cloud GPUs:
For efficient processing, consider using cloud services like AWS, Google Cloud, or Azure, which offer GPU instances.
License
The ViT model is licensed under the Apache-2.0 License, which allows for both personal and commercial use, modification, distribution, and private use.