dpt hybrid midas

Intel

Introduction

The DPT-Hybrid-MiDaS model is a Dense Prediction Transformer designed for monocular depth estimation. Developed by Intel and introduced in the paper "Vision Transformers for Dense Prediction," it leverages a Vision Transformer (ViT) backbone with additional components for depth estimation tasks.

Architecture

The DPT-Hybrid-MiDaS employs a ViT-hybrid architecture, which integrates a Vision Transformer with additional neck and head layers. This model diverges from the standard DPT by using the ViT-hybrid backbone, incorporating specific activations from it. This hybrid structure facilitates accurate depth estimation.

Training

The model is trained on the MIX 6 dataset, comprising approximately 1.4 million images. The training process involves using ImageNet-pretrained weights, resizing images to ensure the longer side is 384 pixels, and applying random horizontal flips for data augmentation. The model's training emphasizes zero-shot transfer capabilities across various datasets.

Guide: Running Locally

To run the DPT-Hybrid-MiDaS model locally, follow these steps:

  1. Install Dependencies: Ensure transformers, torch, and PIL are installed in your Python environment.
  2. Load Model and Processor:
    from transformers import DPTImageProcessor, DPTForDepthEstimation
    
    image_processor = DPTImageProcessor.from_pretrained("Intel/dpt-hybrid-midas")
    model = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas", low_cpu_mem_usage=True)
    
  3. Prepare Input Image:
    from PIL import Image
    import requests
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    inputs = image_processor(images=image, return_tensors="pt")
    
  4. Perform Inference:
    with torch.no_grad():
        outputs = model(**inputs)
        predicted_depth = outputs.predicted_depth
    
  5. Visualize Output:
    import numpy as np
    import torch
    
    prediction = torch.nn.functional.interpolate(
        predicted_depth.unsqueeze(1),
        size=image.size[::-1],
        mode="bicubic",
        align_corners=False,
    )
    
    output = prediction.squeeze().cpu().numpy()
    formatted = (output * 255 / np.max(output)).astype("uint8")
    depth = Image.fromarray(formatted)
    depth.show()
    

For enhanced performance, consider using a cloud GPU service such as AWS EC2, Google Cloud, or Azure.

License

The DPT-Hybrid-MiDaS model is released under the Apache 2.0 license, allowing for both personal and commercial use with appropriate credit.

More Related APIs in Depth Estimation