dinov2 large
facebookIntroduction
The DINOv2-Large model is a Vision Transformer (ViT) trained using the DINOv2 method for learning robust visual features without supervision. It offers a pretrained transformer encoder useful for feature extraction tasks in computer vision.
Architecture
The Vision Transformer (ViT) is a BERT-like transformer encoder model that processes images by dividing them into fixed-size patches, embedding them linearly, and using a [CLS] token for classification tasks. Absolute position embeddings are added before passing the sequence through the transformer layers. The model does not include fine-tuned heads but can be used to extract features for downstream tasks.
Training
The model is pretrained in a self-supervised manner on a large collection of images. This pretraining allows the model to learn inner representations of images, which can be leveraged to extract features for various tasks. For classification tasks, a linear layer can be added on top of the [CLS] token's last hidden state to represent the entire image.
Guide: Running Locally
To run the DINOv2-Large model locally:
- Install the Transformers Library: Ensure you have the
transformers
library installed. - Load the Model:
from transformers import AutoImageProcessor, AutoModel from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained('facebook/dinov2-large') model = AutoModel.from_pretrained('facebook/dinov2-large') inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) last_hidden_states = outputs.last_hidden_state
- Utilize Cloud GPUs: For more intensive tasks, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure for better performance.
License
The DINOv2-Large model is licensed under the Apache 2.0 License.