videomae large
MCG-NJUIntroduction
VideoMAE is a model designed for video classification, extending the concept of Masked Autoencoders (MAE) to video processing. It was pre-trained on the Kinetics-400 dataset in a self-supervised manner for 1600 epochs. The model architecture is similar to a Vision Transformer (ViT) with an added decoder for predicting pixel values in masked patches. VideoMAE is primarily intended for fine-tuning on specific video-related tasks.
Architecture
VideoMAE processes videos as a sequence of fixed-size patches (16x16 resolution), linearly embedded. A classification token ([CLS]) is added at the beginning for classification tasks. Fixed sinusoidal/cosinusoidal position embeddings are included before passing the sequence through the Transformer encoder layers. The model learns to represent video data internally, which can be leveraged for various downstream tasks by adding a classifier on top of the pre-trained encoder.
Training
The model was pre-trained for 1600 epochs on the Kinetics-400 dataset in a self-supervised fashion. Detailed training data, preprocessing, and evaluation results are currently not documented but contributions are welcome.
Guide: Running Locally
To use VideoMAE locally, follow these steps:
from transformers import VideoMAEImageProcessor, VideoMAEForPreTraining
import numpy as np
import torch
num_frames = 16
video = list(np.random.randn(16, 3, 224, 224))
processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-large")
model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-large")
pixel_values = processor(video, return_tensors="pt").pixel_values
num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()
outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
loss = outputs.loss
For efficient processing, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.
License
The VideoMAE model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0).