deeplabv3 mobilevit small

apple

Introduction

The MobileViT + DeepLabV3 model is a lightweight, mobile-friendly vision transformer designed for semantic segmentation tasks. It integrates MobileViT with a DeepLabV3 head and is pre-trained on the PASCAL VOC dataset. This model is suitable for tasks requiring efficient image processing with low latency, leveraging MobileViT's unique architecture that combines local and global processing through transformers.

Architecture

MobileViT is a convolutional neural network that incorporates MobileNetV2-style layers and introduces a new block that replaces local convolutions with global processing using transformers. The model processes image data by flattening patches, feeding them through transformer layers, and then reconstructing them into feature maps. This structure allows MobileViT blocks to be flexibly integrated within a CNN, eliminating the need for positional embeddings.

Training

The model is trained initially from scratch on the ImageNet-1k dataset, using techniques like label smoothing, cosine learning rate annealing, and multi-scale sampling. Fine-tuning is performed on the PASCAL VOC dataset to enhance segmentation capabilities. Training utilized multiple NVIDIA GPUs, enabling efficient processing and fine-tuning.

Guide: Running Locally

  1. Install Dependencies: Ensure you have PyTorch and the transformers library installed.

    pip install torch transformers
    
  2. Load the Model:

    from transformers import MobileViTFeatureExtractor, MobileViTForSemanticSegmentation
    from PIL import Image
    import requests
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/deeplabv3-mobilevit-small")
    model = MobileViTForSemanticSegmentation.from_pretrained("apple/deeplabv3-mobilevit-small")
    
    inputs = feature_extractor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_mask = logits.argmax(1).squeeze(0)
    
  3. Inference: Use the model to predict segmentations for input images.

  4. Cloud GPU Suggestion: For optimal performance, consider using cloud GPU services such as AWS EC2 with NVIDIA GPUs or Google Cloud's AI Platform.

License

The model uses Apple's sample code license, as detailed in the MobileViT GitHub repository.

More Related APIs in Image Segmentation