mobilevit small
appleIntroduction
MobileViT is a light-weight, low-latency vision transformer model designed for image classification tasks. It was developed by Sachin Mehta and Mohammad Rastegari and pre-trained on the ImageNet-1k dataset. The model combines MobileNetV2-style layers with a transformer block to achieve efficient global information processing.
Architecture
MobileViT integrates the efficiency of MobileNetV2 layers with a transformer block that processes image data as flattened patches, similar to the Vision Transformer (ViT). This architecture enables the model to be versatile and efficient, suitable for mobile and general-purpose applications. It does not require positional embeddings, allowing flexible integration into CNNs.
Training
The model was pre-trained on the ImageNet-1k dataset, which includes 1 million images across 1,000 classes. The training process involved multi-scale sampling, data augmentation techniques like random cropping and horizontal flipping, and was performed over 300 epochs using 8 NVIDIA GPUs. The model employed strategies like learning rate warmup, cosine annealing, label smoothing, and L2 weight decay.
Guide: Running Locally
To run the MobileViT model locally:
- Install the necessary libraries, including
transformers
andPIL
. - Load an image from a URL or local source.
- Use
MobileViTFeatureExtractor
andMobileViTForImageClassification
from the Hugging Face Transformers library to process the image and obtain predictions.
Example code snippet:
from transformers import MobileViTFeatureExtractor, MobileViTForImageClassification
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/mobilevit-small")
model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-small")
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
For optimal performance, consider using a cloud GPU service such as AWS, Google Cloud, or Azure.
License
The MobileViT model is released under the Apple sample code license. More information can be found in the Apple ML-CVNets GitHub repository.