dinov2 large

facebook

Introduction

The DINOv2-Large model is a Vision Transformer (ViT) trained using the DINOv2 method for learning robust visual features without supervision. It offers a pretrained transformer encoder useful for feature extraction tasks in computer vision.

Architecture

The Vision Transformer (ViT) is a BERT-like transformer encoder model that processes images by dividing them into fixed-size patches, embedding them linearly, and using a [CLS] token for classification tasks. Absolute position embeddings are added before passing the sequence through the transformer layers. The model does not include fine-tuned heads but can be used to extract features for downstream tasks.

Training

The model is pretrained in a self-supervised manner on a large collection of images. This pretraining allows the model to learn inner representations of images, which can be leveraged to extract features for various tasks. For classification tasks, a linear layer can be added on top of the [CLS] token's last hidden state to represent the entire image.

Guide: Running Locally

To run the DINOv2-Large model locally:

  1. Install the Transformers Library: Ensure you have the transformers library installed.
  2. Load the Model:
    from transformers import AutoImageProcessor, AutoModel
    from PIL import Image
    import requests
    
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    
    processor = AutoImageProcessor.from_pretrained('facebook/dinov2-large')
    model = AutoModel.from_pretrained('facebook/dinov2-large')
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    last_hidden_states = outputs.last_hidden_state
    
  3. Utilize Cloud GPUs: For more intensive tasks, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure for better performance.

License

The DINOv2-Large model is licensed under the Apache 2.0 License.

More Related APIs in Image Feature Extraction