mobilevit xx small
appleIntroduction
MobileViT-XX-Small is a lightweight, mobile-friendly vision transformer model designed for image classification tasks. It combines MobileNetV2-style layers with a novel transformer block to enhance efficiency and maintain low latency.
Architecture
MobileViT integrates traditional convolutional neural networks (CNN) with transformer blocks, replacing local processing in convolutions with global processing. Images are split into flattened patches, processed by the transformer layers, and then reassembled into feature maps. This flexible design allows MobileViT blocks to be integrated into various CNN architectures without requiring positional embeddings.
Training
- Data: The model is pre-trained on the ImageNet-1k dataset, which contains 1 million images across 1,000 classes.
- Preprocessing: Training data undergoes basic augmentation including random resized cropping and horizontal flipping. Multi-scale sampling is utilized for diverse image sizes ranging from 160x160 to 320x320.
- Pretraining: The model is trained for 300 epochs using 8 NVIDIA GPUs, with an effective batch size of 1024. Techniques such as learning rate warmup, cosine annealing, label smoothing, and L2 weight decay are employed.
- Evaluation: The MobileViT-XXS model achieves a top-1 accuracy of 69.0 and a top-5 accuracy of 88.9 on ImageNet.
Guide: Running Locally
- Installation: Install the
transformers
library for PyTorch. - Code Example:
from transformers import MobileViTFeatureExtractor, MobileViTForImageClassification from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/mobilevit-xx-small") model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-xx-small") inputs = feature_extractor(images=image, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx])
- Suggested Hardware: While the model is lightweight, leveraging cloud GPUs such as those offered by AWS or Google Cloud can accelerate training and inference tasks.
License
The MobileViT model uses an Apple sample code license. For more details, refer to the license documentation.