deeplabv3 mobilevit small
appleIntroduction
The MobileViT + DeepLabV3 model is a lightweight, mobile-friendly vision transformer designed for semantic segmentation tasks. It integrates MobileViT with a DeepLabV3 head and is pre-trained on the PASCAL VOC dataset. This model is suitable for tasks requiring efficient image processing with low latency, leveraging MobileViT's unique architecture that combines local and global processing through transformers.
Architecture
MobileViT is a convolutional neural network that incorporates MobileNetV2-style layers and introduces a new block that replaces local convolutions with global processing using transformers. The model processes image data by flattening patches, feeding them through transformer layers, and then reconstructing them into feature maps. This structure allows MobileViT blocks to be flexibly integrated within a CNN, eliminating the need for positional embeddings.
Training
The model is trained initially from scratch on the ImageNet-1k dataset, using techniques like label smoothing, cosine learning rate annealing, and multi-scale sampling. Fine-tuning is performed on the PASCAL VOC dataset to enhance segmentation capabilities. Training utilized multiple NVIDIA GPUs, enabling efficient processing and fine-tuning.
Guide: Running Locally
-
Install Dependencies: Ensure you have PyTorch and the
transformers
library installed.pip install torch transformers
-
Load the Model:
from transformers import MobileViTFeatureExtractor, MobileViTForSemanticSegmentation from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/deeplabv3-mobilevit-small") model = MobileViTForSemanticSegmentation.from_pretrained("apple/deeplabv3-mobilevit-small") inputs = feature_extractor(images=image, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits predicted_mask = logits.argmax(1).squeeze(0)
-
Inference: Use the model to predict segmentations for input images.
-
Cloud GPU Suggestion: For optimal performance, consider using cloud GPU services such as AWS EC2 with NVIDIA GPUs or Google Cloud's AI Platform.
License
The model uses Apple's sample code license, as detailed in the MobileViT GitHub repository.