dinov2 large LLM Model — Open LLM List

Introduction

The DINOv2-Large model is a Vision Transformer (ViT) trained using the DINOv2 method for learning robust visual features without supervision. It offers a pretrained transformer encoder useful for feature extraction tasks in computer vision.

Architecture

The Vision Transformer (ViT) is a BERT-like transformer encoder model that processes images by dividing them into fixed-size patches, embedding them linearly, and using a [CLS] token for classification tasks. Absolute position embeddings are added before passing the sequence through the transformer layers. The model does not include fine-tuned heads but can be used to extract features for downstream tasks.

Training

The model is pretrained in a self-supervised manner on a large collection of images. This pretraining allows the model to learn inner representations of images, which can be leveraged to extract features for various tasks. For classification tasks, a linear layer can be added on top of the [CLS] token's last hidden state to represent the entire image.

Guide: Running Locally

To run the DINOv2-Large model locally:

Install the Transformers Library: Ensure you have the transformers library installed.

Load the Model:

from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained('facebook/dinov2-large')
model = AutoModel.from_pretrained('facebook/dinov2-large')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

Utilize Cloud GPUs: For more intensive tasks, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure for better performance.

License

The DINOv2-Large model is licensed under the Apache 2.0 License.

More Related APIs in Image Feature Extraction