grounding dino tiny
IDEA-ResearchIntroduction
The Grounding DINO model (tiny variant) is an innovative approach in the field of open-set object detection. It integrates a text encoder with a traditional closed-set object detection model, allowing it to perform zero-shot object detection, where objects are detected in images without prior labeled data. The model has shown significant performance, achieving a 52.5 average precision (AP) on the COCO zero-shot benchmark.
Architecture
Grounding DINO combines a DINO-based object detection framework with grounded pre-training. This integration allows the model to translate textual descriptions into object detection tasks, expanding its capabilities beyond traditional methods that require labeled datasets.
Training
The model is trained using both visual and textual data. Grounded pre-training enables the model to understand and detect objects based on natural language descriptions, enhancing its versatility and effectiveness in open-set scenarios.
Guide: Running Locally
To run the model locally for zero-shot object detection, follow these steps:
- Install necessary libraries:
pip install torch transformers pillow requests
- Import required modules in your Python script:
import requests import torch from PIL import Image from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
- Define the model and processor:
model_id = "IDEA-Research/grounding-dino-tiny" device = "cuda" if torch.cuda.is_available() else "cpu" processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
- Load and process an image:
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(image_url, stream=True).raw) text = "a cat. a remote control." inputs = processor(images=image, text=text, return_tensors="pt").to(device)
- Run inference and process results:
with torch.no_grad(): outputs = model(**inputs) results = processor.post_process_grounded_object_detection( outputs, inputs.input_ids, box_threshold=0.4, text_threshold=0.3, target_sizes=[image.size[::-1]] )
For optimal performance, it is recommended to use a cloud GPU service, such as AWS EC2 with GPU support, Google Cloud Platform, or Azure.
License
The Grounding DINO model is released under the Apache 2.0 license, allowing for both personal and commercial use with proper attribution.