dino vits16
facebookIntroduction
The Vision Transformer (ViT) model, trained using the DINO method, was introduced in the paper "Emerging Properties in Self-Supervised Vision Transformers" by Mathilde Caron et al. The model is designed for self-supervised learning and is particularly useful for image feature extraction, leveraging a large dataset such as ImageNet-1k.
Architecture
The ViT model is a transformer encoder, similar to BERT, that processes images as sequences of fixed-size patches. Each image is divided into 16x16 pixel patches and linearly embedded. A classification token ([CLS]) is added at the beginning of the sequence for classification tasks, alongside absolute position embeddings, before being fed into the transformer encoder layers. The model is pre-trained without fine-tuned heads, allowing it to learn general image representations.
Training
The model is trained on ImageNet-1k in a self-supervised manner, meaning it learns to represent images without labeled data. This pre-training allows the model to extract features that can be used for various downstream tasks, such as adding a linear layer on top of the [CLS] token for classification.
Guide: Running Locally
To use the ViT model locally:
- Install the
transformers
library from Hugging Face. - Load and preprocess an image using
PIL
andrequests
. - Use
ViTImageProcessor
andViTModel
from thetransformers
library to process the image. - Extract features or use the model for classification.
from transformers import ViTImageProcessor, ViTModel
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
processor = ViTImageProcessor.from_pretrained('facebook/dino-vits16')
model = ViTModel.from_pretrained('facebook/dino-vits16')
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
For improved performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The model is released under the Apache-2.0 license, allowing for both commercial and non-commercial use with proper attribution.