grounding dino tiny

IDEA-Research

Introduction

The Grounding DINO model (tiny variant) is an innovative approach in the field of open-set object detection. It integrates a text encoder with a traditional closed-set object detection model, allowing it to perform zero-shot object detection, where objects are detected in images without prior labeled data. The model has shown significant performance, achieving a 52.5 average precision (AP) on the COCO zero-shot benchmark.

Architecture

Grounding DINO combines a DINO-based object detection framework with grounded pre-training. This integration allows the model to translate textual descriptions into object detection tasks, expanding its capabilities beyond traditional methods that require labeled datasets.

Training

The model is trained using both visual and textual data. Grounded pre-training enables the model to understand and detect objects based on natural language descriptions, enhancing its versatility and effectiveness in open-set scenarios.

Guide: Running Locally

To run the model locally for zero-shot object detection, follow these steps:

  1. Install necessary libraries:
    pip install torch transformers pillow requests
    
  2. Import required modules in your Python script:
    import requests
    import torch
    from PIL import Image
    from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
    
  3. Define the model and processor:
    model_id = "IDEA-Research/grounding-dino-tiny"
    device = "cuda" if torch.cuda.is_available() else "cpu"
    
    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
    
  4. Load and process an image:
    image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(image_url, stream=True).raw)
    text = "a cat. a remote control."
    
    inputs = processor(images=image, text=text, return_tensors="pt").to(device)
    
  5. Run inference and process results:
    with torch.no_grad():
        outputs = model(**inputs)
    
    results = processor.post_process_grounded_object_detection(
        outputs,
        inputs.input_ids,
        box_threshold=0.4,
        text_threshold=0.3,
        target_sizes=[image.size[::-1]]
    )
    

For optimal performance, it is recommended to use a cloud GPU service, such as AWS EC2 with GPU support, Google Cloud Platform, or Azure.

License

The Grounding DINO model is released under the Apache 2.0 license, allowing for both personal and commercial use with proper attribution.

More Related APIs in Zero Shot Object Detection