dinov2 with registers giant
facebookIntroduction
The Vision Transformer (ViT) model, referred to as DINOv2 with registers, is a transformer encoder model designed for image feature extraction. It was introduced in the paper "Vision Transformers Need Registers" by Darcet et al. and is available in the Hugging Face repository. The model addresses issues related to artifacts in attention maps by incorporating "register" tokens during pre-training, which are discarded afterward, resulting in improved performance and interpretable attention maps.
Architecture
The model leverages the Vision Transformer (ViT) architecture, which is a BERT-like transformer encoder model. During pre-training, new tokens called "register" tokens are used to improve attention map quality and model performance. The model is pre-trained to learn inner representations of images, which can be used for feature extraction and serve as input for downstream tasks, such as image classification through a linear layer.
Training
The DINOv2 model with registers is pre-trained without any fine-tuned heads, focusing on self-supervised image feature extraction. This pre-training allows it to learn meaningful features from images without requiring labeled data, which can be later utilized in tasks where labeled datasets are available by adding a linear classifier.
Guide: Running Locally
To use the model locally, follow these steps:
-
Install the
transformers
library from Hugging Face. -
Use the following Python code to load the model and processor for image feature extraction:
from transformers import AutoImageProcessor, AutoModel from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained('facebook/dinov2-with-registers-giant') model = AutoModel.from_pretrained('facebook/dinov2-with-registers-giant') inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) last_hidden_states = outputs.last_hidden_state
-
Consider using cloud GPU services such as AWS, GCP, or Azure for efficient processing, especially for large datasets or models.
License
The model is released under the Apache 2.0 License, which allows for both commercial and non-commercial use, modification, distribution, and private use.