Introduction

The Vision Transformer (ViT) model, trained using the DINO method, was introduced in the paper "Emerging Properties in Self-Supervised Vision Transformers" by Mathilde Caron et al. The model is designed for self-supervised learning and is particularly useful for image feature extraction, leveraging a large dataset such as ImageNet-1k.

Architecture

The ViT model is a transformer encoder, similar to BERT, that processes images as sequences of fixed-size patches. Each image is divided into 16x16 pixel patches and linearly embedded. A classification token ([CLS]) is added at the beginning of the sequence for classification tasks, alongside absolute position embeddings, before being fed into the transformer encoder layers. The model is pre-trained without fine-tuned heads, allowing it to learn general image representations.

Training

The model is trained on ImageNet-1k in a self-supervised manner, meaning it learns to represent images without labeled data. This pre-training allows the model to extract features that can be used for various downstream tasks, such as adding a linear layer on top of the [CLS] token for classification.

Guide: Running Locally

To use the ViT model locally:

  1. Install the transformers library from Hugging Face.
  2. Load and preprocess an image using PIL and requests.
  3. Use ViTImageProcessor and ViTModel from the transformers library to process the image.
  4. Extract features or use the model for classification.
from transformers import ViTImageProcessor, ViTModel
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = ViTImageProcessor.from_pretrained('facebook/dino-vits16')
model = ViTModel.from_pretrained('facebook/dino-vits16')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

For improved performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The model is released under the Apache-2.0 license, allowing for both commercial and non-commercial use with proper attribution.

More Related APIs in Image Feature Extraction