mobilevit small

apple

Introduction

MobileViT is a light-weight, low-latency vision transformer model designed for image classification tasks. It was developed by Sachin Mehta and Mohammad Rastegari and pre-trained on the ImageNet-1k dataset. The model combines MobileNetV2-style layers with a transformer block to achieve efficient global information processing.

Architecture

MobileViT integrates the efficiency of MobileNetV2 layers with a transformer block that processes image data as flattened patches, similar to the Vision Transformer (ViT). This architecture enables the model to be versatile and efficient, suitable for mobile and general-purpose applications. It does not require positional embeddings, allowing flexible integration into CNNs.

Training

The model was pre-trained on the ImageNet-1k dataset, which includes 1 million images across 1,000 classes. The training process involved multi-scale sampling, data augmentation techniques like random cropping and horizontal flipping, and was performed over 300 epochs using 8 NVIDIA GPUs. The model employed strategies like learning rate warmup, cosine annealing, label smoothing, and L2 weight decay.

Guide: Running Locally

To run the MobileViT model locally:

  1. Install the necessary libraries, including transformers and PIL.
  2. Load an image from a URL or local source.
  3. Use MobileViTFeatureExtractor and MobileViTForImageClassification from the Hugging Face Transformers library to process the image and obtain predictions.

Example code snippet:

from transformers import MobileViTFeatureExtractor, MobileViTForImageClassification
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/mobilevit-small")
model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-small")

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

For optimal performance, consider using a cloud GPU service such as AWS, Google Cloud, or Azure.

License

The MobileViT model is released under the Apple sample code license. More information can be found in the Apple ML-CVNets GitHub repository.

More Related APIs in Image Classification