dinov2 with registers giant

facebook

Introduction
The Vision Transformer (ViT) model, referred to as DINOv2 with registers, is a transformer encoder model designed for image feature extraction. It was introduced in the paper "Vision Transformers Need Registers" by Darcet et al. and is available in the Hugging Face repository. The model addresses issues related to artifacts in attention maps by incorporating "register" tokens during pre-training, which are discarded afterward, resulting in improved performance and interpretable attention maps.

Architecture
The model leverages the Vision Transformer (ViT) architecture, which is a BERT-like transformer encoder model. During pre-training, new tokens called "register" tokens are used to improve attention map quality and model performance. The model is pre-trained to learn inner representations of images, which can be used for feature extraction and serve as input for downstream tasks, such as image classification through a linear layer.

Training
The DINOv2 model with registers is pre-trained without any fine-tuned heads, focusing on self-supervised image feature extraction. This pre-training allows it to learn meaningful features from images without requiring labeled data, which can be later utilized in tasks where labeled datasets are available by adding a linear classifier.

Guide: Running Locally
To use the model locally, follow these steps:

  1. Install the transformers library from Hugging Face.

  2. Use the following Python code to load the model and processor for image feature extraction:

    from transformers import AutoImageProcessor, AutoModel
    from PIL import Image
    import requests
    
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    
    processor = AutoImageProcessor.from_pretrained('facebook/dinov2-with-registers-giant')
    model = AutoModel.from_pretrained('facebook/dinov2-with-registers-giant')
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    last_hidden_states = outputs.last_hidden_state
    
  3. Consider using cloud GPU services such as AWS, GCP, or Azure for efficient processing, especially for large datasets or models.

License
The model is released under the Apache 2.0 License, which allows for both commercial and non-commercial use, modification, distribution, and private use.

More Related APIs in Image Feature Extraction