Introduction

The Vision Transformer (ViT) is a model pretrained using supervised learning on a large image dataset, ImageNet-21k, with an image resolution of 224x224 pixels. Images are divided into fixed-size patches (16x16 resolution) and linearly embedded into the model. A [CLS] token is added for classification tasks, and absolute position embeddings are included before feeding the sequence into the Transformer encoder. The model does not have fine-tuned heads, as they have been zeroed by Google researchers, but it includes a pretrained pooler useful for downstream tasks like image classification.

Architecture

ViT treats images as sequences of patches, embedding them linearly before processing. The model uses a Transformer encoder architecture and adds a [CLS] token for classification. It incorporates absolute position embeddings to maintain the spatial information of patches. The pretrained encoder learns internal image representations, which can be used for tasks such as image classification by adding a linear layer on top of the [CLS] token's final hidden state.

Training

The model is pretrained on ImageNet-21k, allowing it to learn rich representations of images. This pretrained state enables it to be used in various downstream tasks by adding a simple linear classification layer on top of the encoder.

Guide: Running Locally

  1. Clone the Repository:
    git clone git@hf.co:Genius-Society/ViT
    cd ViT
    
  2. Download the Model:
    from modelscope import snapshot_download
    model_dir = snapshot_download('Genius-Society/ViT')
    
  3. Hardware Recommendation:
    • For faster training and inference, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.

License

The Vision Transformer model by Genius Society is licensed under the MIT License.

More Related APIs