owlv2 large patch14 ensemble
googleIntroduction
OWLv2 is a zero-shot, text-conditioned object detection model designed for open-vocabulary object detection. It was introduced in the paper "Scaling Open-Vocabulary Object Detection" by Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. The model enables users to query an image with one or more text queries to detect objects.
Architecture
OWLv2 utilizes a CLIP backbone, with a ViT-L/14 Transformer architecture for the image encoder and a masked self-attention Transformer for the text encoder. The encoders are trained to maximize the similarity of image-text pairs through contrastive loss. The CLIP model is trained from scratch and fine-tuned with additional classification and box prediction heads for object detection, using a bipartite matching loss.
Training
The model's backbone was trained on publicly available image-caption datasets, combining web-crawled data and pre-existing image datasets like YFCC100M. The object detection capabilities were fine-tuned on datasets like COCO and OpenImages. The training process involves optimizing the similarity between image-text pairs and refining object detection with classification and box prediction.
Guide: Running Locally
-
Install Dependencies: Ensure you have Python and PyTorch installed. Use
pip
to install thetransformers
library. -
Load the Model:
from transformers import Owlv2Processor, Owlv2ForObjectDetection processor = Owlv2Processor.from_pretrained("google/owlv2-large-patch14-ensemble") model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-large-patch14-ensemble")
-
Input Preparation: Use an image URL and text queries to prepare inputs.
import requests from PIL import Image url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) texts = [["a photo of a cat", "a photo of a dog"]] inputs = processor(text=texts, images=image, return_tensors="pt")
-
Inference: Run the model to get predictions.
import torch with torch.no_grad(): outputs = model(**inputs)
-
Post-Processing: Convert outputs to a readable format.
results = processor.post_process_object_detection(outputs=outputs, target_sizes=torch.Tensor([image.size[::-1]]), threshold=0.1)
Suggested Cloud GPU Providers: Consider using cloud platforms like AWS, Google Cloud, or Azure for GPU support to enhance performance during inference.
License
OWLv2 is released under the Apache 2.0 License, allowing use, modification, and distribution with proper attribution.