Data-Efficient Image Transformer (DeiT-Small-Patch16-224)

Introduction

The Data-efficient Image Transformer (DeiT) is a small-sized model pre-trained and fine-tuned on the ImageNet-1k dataset, consisting of 1 million images across 1,000 classes, at a resolution of 224x224 pixels. It was introduced in the paper "Training data-efficient image transformers & distillation through attention" by Touvron et al. This model is an efficiently trained Vision Transformer (ViT) suitable for image classification tasks.

Architecture

DeiT is a Vision Transformer (ViT), which is a transformer encoder model similar to BERT, pre-trained and fine-tuned on large collections of images. Images are divided into fixed-size patches (16x16 resolution) that are linearly embedded. A [CLS] token is added to the sequence for classification tasks, along with absolute position embeddings, before passing them through the Transformer encoder layers. The model learns to represent images internally, which can be used for downstream tasks like image classification by adding a linear layer on top of the [CLS] token.

Training

Training Data

The model was pre-trained on the ImageNet-1k dataset.

Training Procedure

Preprocessing: Images are resized to 256x256, center-cropped to 224x224, and normalized with ImageNet's mean and standard deviation.
Pretraining: Conducted on a single 8-GPU node over 3 days, with a training resolution of 224. Hyperparameters like batch size and learning rate are detailed in the original paper.

Evaluation Results

DeiT-small achieves an ImageNet top-1 accuracy of 79.9% and top-5 accuracy of 95.0%. The model consists of 22 million parameters.
Fine-tuning to a higher resolution (384x384) can improve accuracy.

Guide: Running Locally

Requirements:
- Install necessary libraries: transformers, Pillow, torch.
- Use a cloud GPU like AWS EC2, Google Cloud, or Azure for better performance.

Setup:

from transformers import AutoFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = AutoFeatureExtractor.from_pretrained('facebook/deit-small-patch16-224')
model = ViTForImageClassification.from_pretrained('facebook/deit-small-patch16-224')

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

License

The model is licensed under the Apache 2.0 License.

More Related APIs in Image Classification

deit small patch16 224