deit base patch16 224

facebook

Introduction

The Data-efficient Image Transformer (DeiT) is a vision transformer model pre-trained and fine-tuned on the ImageNet-1k dataset. It was introduced in the paper "Training data-efficient image transformers & distillation through attention" by Touvron et al. Unlike traditional convolutional neural networks, DeiT uses a transformer architecture for image classification tasks.

Architecture

DeiT uses a Vision Transformer (ViT) architecture, which is a transformer encoder model, similar to BERT, tailored for image data. Images are divided into fixed-size patches (16x16 pixels), embedded linearly, and fed into the transformer layers. A [CLS] token is added for classification tasks, and absolute position embeddings are included before processing through the transformer encoder. The final model representation is typically obtained from the last hidden state of the [CLS] token.

Training

The ViT model was trained on the ImageNet-1k dataset, consisting of one million images across 1,000 classes. Preprocessing involves resizing images to 256x256, center-cropping to 224x224, and normalization using ImageNet's mean and standard deviation. The model was trained on an 8-GPU node over three days at a resolution of 224x224 with specific hyperparameters detailed in the original paper.

Guide: Running Locally

  1. Installation: Ensure you have Python installed along with the transformers library from Hugging Face.

    pip install transformers
    
  2. Load Model: Use the following code to load and use the model for image classification.

    from transformers import AutoFeatureExtractor, ViTForImageClassification
    from PIL import Image
    import requests
    
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    
    feature_extractor = AutoFeatureExtractor.from_pretrained('facebook/deit-base-patch16-224')
    model = ViTForImageClassification.from_pretrained('facebook/deit-base-patch16-224')
    
    inputs = feature_extractor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_idx = logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class_idx])
    
  3. Environment: A GPU is recommended for optimal performance when running locally. Consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

The DeiT model is released under the Apache-2.0 license, allowing for both personal and commercial use with proper attribution.

More Related APIs in Image Classification