owlv2 large patch14 ensemble

google

Introduction

OWLv2 is a zero-shot, text-conditioned object detection model designed for open-vocabulary object detection. It was introduced in the paper "Scaling Open-Vocabulary Object Detection" by Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. The model enables users to query an image with one or more text queries to detect objects.

Architecture

OWLv2 utilizes a CLIP backbone, with a ViT-L/14 Transformer architecture for the image encoder and a masked self-attention Transformer for the text encoder. The encoders are trained to maximize the similarity of image-text pairs through contrastive loss. The CLIP model is trained from scratch and fine-tuned with additional classification and box prediction heads for object detection, using a bipartite matching loss.

Training

The model's backbone was trained on publicly available image-caption datasets, combining web-crawled data and pre-existing image datasets like YFCC100M. The object detection capabilities were fine-tuned on datasets like COCO and OpenImages. The training process involves optimizing the similarity between image-text pairs and refining object detection with classification and box prediction.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python and PyTorch installed. Use pip to install the transformers library.

  2. Load the Model:

    from transformers import Owlv2Processor, Owlv2ForObjectDetection
    processor = Owlv2Processor.from_pretrained("google/owlv2-large-patch14-ensemble")
    model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-large-patch14-ensemble")
    
  3. Input Preparation: Use an image URL and text queries to prepare inputs.

    import requests
    from PIL import Image
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    texts = [["a photo of a cat", "a photo of a dog"]]
    inputs = processor(text=texts, images=image, return_tensors="pt")
    
  4. Inference: Run the model to get predictions.

    import torch
    with torch.no_grad():
      outputs = model(**inputs)
    
  5. Post-Processing: Convert outputs to a readable format.

    results = processor.post_process_object_detection(outputs=outputs, target_sizes=torch.Tensor([image.size[::-1]]), threshold=0.1)
    

Suggested Cloud GPU Providers: Consider using cloud platforms like AWS, Google Cloud, or Azure for GPU support to enhance performance during inference.

License

OWLv2 is released under the Apache 2.0 License, allowing use, modification, and distribution with proper attribution.

More Related APIs in Zero Shot Object Detection