beit base patch16 224 pt22k ft22k

microsoft

Introduction

BEiT (BERT Pre-Training of Image Transformers) is a Vision Transformer (ViT) model developed by Microsoft. It is pre-trained in a self-supervised manner on the ImageNet-21k dataset, which comprises 14 million images across 21,841 classes, and fine-tuned on the same dataset at a resolution of 224x224 pixels. The model aims to predict visual tokens from masked patches using OpenAI's DALL-E's VQ-VAE encoder.

Architecture

The BEiT model is a transformer encoder model similar to BERT. Unlike the original ViT models, BEiT uses relative position embeddings instead of absolute ones and performs image classification by mean-pooling the final hidden states of the patches. Images are processed as sequences of fixed-size patches (16x16 resolution), which are linearly embedded. The model learns a representation of images useful for downstream tasks through its pre-training.

Training

Training Data

The model was pre-trained on the ImageNet-21k dataset and fine-tuned on the same dataset, consisting of 14 million images and 21,841 classes.

Training Procedure

Preprocessing

Images are resized to 224x224 resolution and normalized with a mean of (0.5, 0.5, 0.5) and a standard deviation of (0.5, 0.5, 0.5). Details on preprocessing can be found in the BEiT datasets script.

Pretraining

For hyperparameters related to pre-training, refer to page 15 of the BEiT paper here.

Guide: Running Locally

To run the model locally for image classification, follow these steps:

  1. Install the transformers library.

  2. Use the following Python code to classify an image:

    from transformers import BeitImageProcessor, BeitForImageClassification
    from PIL import Image
    import requests
    
    url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
    image = Image.open(requests.get(url, stream=True).raw)
    
    processor = BeitImageProcessor.from_pretrained('microsoft/beit-base-patch16-224-pt22k-ft22k')
    model = BeitForImageClassification.from_pretrained('microsoft/beit-base-patch16-224-pt22k-ft22k')
    
    inputs = processor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_idx = logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class_idx])
    
  3. Consider using cloud-based GPUs such as AWS, GCP, or Azure for efficient processing, especially for larger datasets or models.

License

This model is licensed under the Apache 2.0 License.

More Related APIs in Image Classification