dpt hybrid midas
IntelIntroduction
The DPT-Hybrid-MiDaS model is a Dense Prediction Transformer designed for monocular depth estimation. Developed by Intel and introduced in the paper "Vision Transformers for Dense Prediction," it leverages a Vision Transformer (ViT) backbone with additional components for depth estimation tasks.
Architecture
The DPT-Hybrid-MiDaS employs a ViT-hybrid architecture, which integrates a Vision Transformer with additional neck and head layers. This model diverges from the standard DPT by using the ViT-hybrid backbone, incorporating specific activations from it. This hybrid structure facilitates accurate depth estimation.
Training
The model is trained on the MIX 6 dataset, comprising approximately 1.4 million images. The training process involves using ImageNet-pretrained weights, resizing images to ensure the longer side is 384 pixels, and applying random horizontal flips for data augmentation. The model's training emphasizes zero-shot transfer capabilities across various datasets.
Guide: Running Locally
To run the DPT-Hybrid-MiDaS model locally, follow these steps:
- Install Dependencies: Ensure
transformers
,torch
, andPIL
are installed in your Python environment. - Load Model and Processor:
from transformers import DPTImageProcessor, DPTForDepthEstimation image_processor = DPTImageProcessor.from_pretrained("Intel/dpt-hybrid-midas") model = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas", low_cpu_mem_usage=True)
- Prepare Input Image:
from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = image_processor(images=image, return_tensors="pt")
- Perform Inference:
with torch.no_grad(): outputs = model(**inputs) predicted_depth = outputs.predicted_depth
- Visualize Output:
import numpy as np import torch prediction = torch.nn.functional.interpolate( predicted_depth.unsqueeze(1), size=image.size[::-1], mode="bicubic", align_corners=False, ) output = prediction.squeeze().cpu().numpy() formatted = (output * 255 / np.max(output)).astype("uint8") depth = Image.fromarray(formatted) depth.show()
For enhanced performance, consider using a cloud GPU service such as AWS EC2, Google Cloud, or Azure.
License
The DPT-Hybrid-MiDaS model is released under the Apache 2.0 license, allowing for both personal and commercial use with appropriate credit.