vit large patch16 224

google

Introduction

The Vision Transformer (ViT) is a large-sized model designed for image classification tasks. It was pre-trained on the ImageNet-21k dataset and fine-tuned on ImageNet 2012. The model treats images as sequences of patches, allowing it to leverage the transformer architecture for image recognition.

Architecture

ViT uses a transformer encoder architecture, similar to BERT, tailored for image data. Images are divided into 16x16 patches, which are embedded and processed as sequences. A special [CLS] token is added for classification tasks, and absolute position embeddings are used before feeding the data into the transformer layers.

Training

ViT was pre-trained using the ImageNet-21k dataset, which includes 14 million images and over 21,843 classes. It was then fine-tuned on the ImageNet 2012 dataset with 1 million images and 1,000 classes. Preprocessing includes resizing images to 224x224 and normalizing them. Training was conducted on TPUv3 hardware with a batch size of 4096, and gradient clipping was applied.

Guide: Running Locally

  1. Install Dependencies:

    • Install the transformers library and other dependencies like PIL for image handling.
  2. Load the Model:

    from transformers import ViTFeatureExtractor, ViTForImageClassification
    from PIL import Image
    import requests
    
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-large-patch16-224')
    model = ViTForImageClassification.from_pretrained('google/vit-large-patch16-224')
    inputs = feature_extractor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_idx = logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class_idx])
    
  3. Consider Cloud GPUs:
    For efficient processing, consider using cloud services like AWS, Google Cloud, or Azure, which offer GPU instances.

License

The ViT model is licensed under the Apache-2.0 License, which allows for both personal and commercial use, modification, distribution, and private use.

More Related APIs in Image Classification