beit base patch16 224 pt22k ft22k
microsoftIntroduction
BEiT (BERT Pre-Training of Image Transformers) is a Vision Transformer (ViT) model developed by Microsoft. It is pre-trained in a self-supervised manner on the ImageNet-21k dataset, which comprises 14 million images across 21,841 classes, and fine-tuned on the same dataset at a resolution of 224x224 pixels. The model aims to predict visual tokens from masked patches using OpenAI's DALL-E's VQ-VAE encoder.
Architecture
The BEiT model is a transformer encoder model similar to BERT. Unlike the original ViT models, BEiT uses relative position embeddings instead of absolute ones and performs image classification by mean-pooling the final hidden states of the patches. Images are processed as sequences of fixed-size patches (16x16 resolution), which are linearly embedded. The model learns a representation of images useful for downstream tasks through its pre-training.
Training
Training Data
The model was pre-trained on the ImageNet-21k dataset and fine-tuned on the same dataset, consisting of 14 million images and 21,841 classes.
Training Procedure
Preprocessing
Images are resized to 224x224 resolution and normalized with a mean of (0.5, 0.5, 0.5) and a standard deviation of (0.5, 0.5, 0.5). Details on preprocessing can be found in the BEiT datasets script.
Pretraining
For hyperparameters related to pre-training, refer to page 15 of the BEiT paper here.
Guide: Running Locally
To run the model locally for image classification, follow these steps:
-
Install the
transformers
library. -
Use the following Python code to classify an image:
from transformers import BeitImageProcessor, BeitForImageClassification from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) processor = BeitImageProcessor.from_pretrained('microsoft/beit-base-patch16-224-pt22k-ft22k') model = BeitForImageClassification.from_pretrained('microsoft/beit-base-patch16-224-pt22k-ft22k') inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx])
-
Consider using cloud-based GPUs such as AWS, GCP, or Azure for efficient processing, especially for larger datasets or models.
License
This model is licensed under the Apache 2.0 License.