vit mae large

facebook

Vision Transformer (ViT-MAE-Large) Model

Introduction

The Vision Transformer (ViT-MAE-Large) is a large-sized model pre-trained using the Masked Autoencoder (MAE) method. It was introduced in the paper "Masked Autoencoders Are Scalable Vision Learners" by Kaiming He et al. This model is designed for image classification tasks and is available on Hugging Face's model hub.

Architecture

The ViT model functions as a transformer encoder, similar to BERT, and processes images as sequences of fixed-size patches. During pre-training, a significant portion (75%) of image patches is randomly masked. The model uses an encoder to process the visual patches and a decoder that reconstructs the masked patches by predicting raw pixel values.

Training

The model is pre-trained by learning an internal representation of images, which can be utilized to extract features for downstream tasks. For instance, a linear classifier can be added to the pre-trained encoder to perform tasks like image classification on labeled datasets.

Guide: Running Locally

To use the model locally, you need to install the Hugging Face Transformers library and load the pre-trained model as follows:

from transformers import AutoImageProcessor, ViTMAEForPreTraining
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained('facebook/vit-mae-large')
model = ViTMAEForPreTraining.from_pretrained('facebook/vit-mae-large')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
loss = outputs.loss
mask = outputs.mask
ids_restore = outputs.ids_restore

Suggested Cloud GPUs

For faster inference and training, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure, which provide scalable resources optimized for machine learning tasks.

License

The ViT-MAE-Large model is released under the Apache License 2.0.

More Related APIs