vit large patch16 224 in21k

google

Introduction

The Vision Transformer (ViT) is a large-sized model designed for image recognition tasks. Pre-trained on the ImageNet-21k dataset, it processes images as sequences of patches, leveraging transformer architectures to extract features for classification and other vision tasks.

Architecture

The ViT model is a transformer encoder, similar to BERT, utilizing sequences of 16x16 image patches linearly embedded for processing. A [CLS] token is added for classification purposes, and absolute position embeddings are incorporated to maintain spatial information. The model contains a pre-trained pooler but lacks fine-tuned heads, making it suitable for feature extraction and downstream task adaptation.

Training

Training Data

ViT was pre-trained on the ImageNet-21k dataset, comprising 14 million images across 21,843 classes.

Training Procedure

The model was trained on TPUv3 hardware, with a batch size of 4096 and a learning rate warmup of 10k steps. Images were preprocessed by resizing to 224x224 resolution and normalizing RGB channels. Gradient clipping with a global norm of 1 was applied during training to improve results. Evaluation on image classification benchmarks suggested better performance at higher resolutions and model sizes.

Guide: Running Locally

To run the ViT model locally:

  1. Install Dependencies: Ensure you have the transformers library and PIL for image processing.
  2. Load the Model:
    from transformers import ViTImageProcessor, ViTModel
    from PIL import Image
    import requests
    
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    
    processor = ViTImageProcessor.from_pretrained('google/vit-large-patch16-224-in21k')
    model = ViTModel.from_pretrained('google/vit-large-patch16-224-in21k')
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    last_hidden_state = outputs.last_hidden_state
    
  3. Execution Environment: Running on a local machine with a powerful CPU or GPU is recommended. For larger models or datasets, consider using cloud GPU services like AWS, Google Cloud, or Azure for better performance.

License

The Vision Transformer model is released under the Apache-2.0 License, allowing for broad use in both commercial and non-commercial applications.

More Related APIs in Image Feature Extraction