mobilevit xx small

apple

Introduction

MobileViT-XX-Small is a lightweight, mobile-friendly vision transformer model designed for image classification tasks. It combines MobileNetV2-style layers with a novel transformer block to enhance efficiency and maintain low latency.

Architecture

MobileViT integrates traditional convolutional neural networks (CNN) with transformer blocks, replacing local processing in convolutions with global processing. Images are split into flattened patches, processed by the transformer layers, and then reassembled into feature maps. This flexible design allows MobileViT blocks to be integrated into various CNN architectures without requiring positional embeddings.

Training

  • Data: The model is pre-trained on the ImageNet-1k dataset, which contains 1 million images across 1,000 classes.
  • Preprocessing: Training data undergoes basic augmentation including random resized cropping and horizontal flipping. Multi-scale sampling is utilized for diverse image sizes ranging from 160x160 to 320x320.
  • Pretraining: The model is trained for 300 epochs using 8 NVIDIA GPUs, with an effective batch size of 1024. Techniques such as learning rate warmup, cosine annealing, label smoothing, and L2 weight decay are employed.
  • Evaluation: The MobileViT-XXS model achieves a top-1 accuracy of 69.0 and a top-5 accuracy of 88.9 on ImageNet.

Guide: Running Locally

  1. Installation: Install the transformers library for PyTorch.
  2. Code Example:
    from transformers import MobileViTFeatureExtractor, MobileViTForImageClassification
    from PIL import Image
    import requests
    
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    
    feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/mobilevit-xx-small")
    model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-xx-small")
    
    inputs = feature_extractor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    
    predicted_class_idx = logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class_idx])
    
  3. Suggested Hardware: While the model is lightweight, leveraging cloud GPUs such as those offered by AWS or Google Cloud can accelerate training and inference tasks.

License

The MobileViT model uses an Apple sample code license. For more details, refer to the license documentation.

More Related APIs in Image Classification