vit base patch16 224

google

Vision Transformer (ViT) Base Model

Introduction

The Vision Transformer (ViT) is a model that employs transformer encoders, similar to BERT, for image classification tasks. It is pretrained on the ImageNet-21k dataset and fine-tuned on ImageNet 2012. This model is particularly effective for image classification, using a sequence of image patches as input.

Architecture

ViT treats images as sequences of patches (16x16 pixels each) instead of pixels. Each patch is linearly embedded, and a classification token ([CLS]) is prepended to the sequence. The model uses absolute position embeddings and a transformer encoder to process these sequences, enabling it to learn representations useful for various image classification tasks.

Training

Training Data

  • Pretraining Dataset: ImageNet-21k with 14 million images and 21,843 classes.
  • Fine-tuning Dataset: ImageNet (1 million images, 1,000 classes).

Training Procedure

  • Hardware: Trained on TPUv3 with 8 cores.
  • Batch Size: 4096.
  • Learning Rate: Warmup over 10,000 steps.
  • Gradient Clipping: Applied at a global norm of 1 for ImageNet.
  • Resolution: 224x224 during training, with better results at 384x384 for fine-tuning.

Guide: Running Locally

  1. Setup Environment: Ensure you have Python and transformers library installed.
  2. Load Model and Processor:
    from transformers import ViTImageProcessor, ViTForImageClassification
    from PIL import Image
    import requests
    
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    
    processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
    model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    predicted_class_idx = outputs.logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class_idx])
    
  3. Cloud GPUs: Consider using cloud services like AWS, GCP, or Azure for access to GPUs for efficient processing.

License

This model is licensed under the Apache 2.0 License, allowing for both academic and commercial use.

More Related APIs in Image Classification