deit small patch16 224
facebookData-Efficient Image Transformer (DeiT-Small-Patch16-224)
Introduction
The Data-efficient Image Transformer (DeiT) is a small-sized model pre-trained and fine-tuned on the ImageNet-1k dataset, consisting of 1 million images across 1,000 classes, at a resolution of 224x224 pixels. It was introduced in the paper "Training data-efficient image transformers & distillation through attention" by Touvron et al. This model is an efficiently trained Vision Transformer (ViT) suitable for image classification tasks.
Architecture
DeiT is a Vision Transformer (ViT), which is a transformer encoder model similar to BERT, pre-trained and fine-tuned on large collections of images. Images are divided into fixed-size patches (16x16 resolution) that are linearly embedded. A [CLS] token is added to the sequence for classification tasks, along with absolute position embeddings, before passing them through the Transformer encoder layers. The model learns to represent images internally, which can be used for downstream tasks like image classification by adding a linear layer on top of the [CLS] token.
Training
Training Data
The model was pre-trained on the ImageNet-1k dataset.
Training Procedure
- Preprocessing: Images are resized to 256x256, center-cropped to 224x224, and normalized with ImageNet's mean and standard deviation.
- Pretraining: Conducted on a single 8-GPU node over 3 days, with a training resolution of 224. Hyperparameters like batch size and learning rate are detailed in the original paper.
Evaluation Results
- DeiT-small achieves an ImageNet top-1 accuracy of 79.9% and top-5 accuracy of 95.0%. The model consists of 22 million parameters.
- Fine-tuning to a higher resolution (384x384) can improve accuracy.
Guide: Running Locally
-
Requirements:
- Install necessary libraries:
transformers
,Pillow
,torch
. - Use a cloud GPU like AWS EC2, Google Cloud, or Azure for better performance.
- Install necessary libraries:
-
Setup:
from transformers import AutoFeatureExtractor, ViTForImageClassification from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) feature_extractor = AutoFeatureExtractor.from_pretrained('facebook/deit-small-patch16-224') model = ViTForImageClassification.from_pretrained('facebook/deit-small-patch16-224') inputs = feature_extractor(images=image, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx])
License
The model is licensed under the Apache 2.0 License.