grounding dino tiny LLM Model

Introduction

The Grounding DINO model (tiny variant) is an innovative approach in the field of open-set object detection. It integrates a text encoder with a traditional closed-set object detection model, allowing it to perform zero-shot object detection, where objects are detected in images without prior labeled data. The model has shown significant performance, achieving a 52.5 average precision (AP) on the COCO zero-shot benchmark.

Architecture

Grounding DINO combines a DINO-based object detection framework with grounded pre-training. This integration allows the model to translate textual descriptions into object detection tasks, expanding its capabilities beyond traditional methods that require labeled datasets.

Training

The model is trained using both visual and textual data. Grounded pre-training enables the model to understand and detect objects based on natural language descriptions, enhancing its versatility and effectiveness in open-set scenarios.

Guide: Running Locally

To run the model locally for zero-shot object detection, follow these steps:

Install necessary libraries:

pip install torch transformers pillow requests

Import required modules in your Python script:

import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

Define the model and processor:

model_id = "IDEA-Research/grounding-dino-tiny"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

Load and process an image:

image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
text = "a cat. a remote control."

inputs = processor(images=image, text=text, return_tensors="pt").to(device)

Run inference and process results:

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_grounded_object_detection(
    outputs,
    inputs.input_ids,
    box_threshold=0.4,
    text_threshold=0.3,
    target_sizes=[image.size[::-1]]
)

For optimal performance, it is recommended to use a cloud GPU service, such as AWS EC2 with GPU support, Google Cloud Platform, or Azure.

License

The Grounding DINO model is released under the Apache 2.0 license, allowing for both personal and commercial use with proper attribution.

More Related APIs in Zero Shot Object Detection

grounding dino base

Zero Shot Object Detection

8 months ago

778.7K

owlv2 large patch14 ensemble

Zero Shot Object Detection

2 months ago

134K

owlv2 base patch16 ensemble

Zero Shot Object Detection

2 months ago

914.7K