vit huge patch14 224 in21k

google

Introduction

The Vision Transformer (ViT) is a model developed by Google for image recognition, leveraging a transformer encoder architecture. It was pre-trained on the ImageNet-21k dataset, which includes 14 million images across 21,843 classes. The model processes images as sequences of fixed-size patches and is designed to provide robust feature extraction capabilities for downstream tasks such as image classification.

Architecture

ViT employs a transformer encoder, similar to BERT, tailored for image processing. Images are divided into 16x16 patches, linearly embedded, and supplemented with absolute position embeddings and a classification token ([CLS]) for classification tasks. The model does not include fine-tuned heads but retains a pre-trained pooler for feature extraction.

Training

The model was pre-trained on the ImageNet-21k dataset using TPUv3 hardware with 8 cores. Training involved a batch size of 4096, learning rate warmup over 10,000 steps, and gradient clipping at a global norm of 1. Images were preprocessed to a resolution of 224x224, normalized across RGB channels, and resized as necessary. Evaluation indicates that larger model sizes and higher resolution (384x384) during fine-tuning improve performance.

Guide: Running Locally

To run the ViT model locally, follow these steps:

  1. Install Transformers Library:

    pip install transformers
    
  2. Load the Model and Feature Extractor:

    from transformers import ViTFeatureExtractor, ViTModel
    from PIL import Image
    import requests
    
    # Load image
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    
    # Load pre-trained model and feature extractor
    feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-huge-patch14-224-in21k')
    model = ViTModel.from_pretrained('google/vit-huge-patch14-224-in21k')
    
    # Prepare inputs
    inputs = feature_extractor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    last_hidden_states = outputs.last_hidden_state
    
  3. Cloud GPUs: For enhanced performance, consider using cloud services like AWS EC2, Google Cloud Platform, or Azure that provide GPU instances.

License

The Vision Transformer model is released under the Apache 2.0 license.

More Related APIs in Image Feature Extraction