videomae large

MCG-NJU

Introduction

VideoMAE is a model designed for video classification, extending the concept of Masked Autoencoders (MAE) to video processing. It was pre-trained on the Kinetics-400 dataset in a self-supervised manner for 1600 epochs. The model architecture is similar to a Vision Transformer (ViT) with an added decoder for predicting pixel values in masked patches. VideoMAE is primarily intended for fine-tuning on specific video-related tasks.

Architecture

VideoMAE processes videos as a sequence of fixed-size patches (16x16 resolution), linearly embedded. A classification token ([CLS]) is added at the beginning for classification tasks. Fixed sinusoidal/cosinusoidal position embeddings are included before passing the sequence through the Transformer encoder layers. The model learns to represent video data internally, which can be leveraged for various downstream tasks by adding a classifier on top of the pre-trained encoder.

Training

The model was pre-trained for 1600 epochs on the Kinetics-400 dataset in a self-supervised fashion. Detailed training data, preprocessing, and evaluation results are currently not documented but contributions are welcome.

Guide: Running Locally

To use VideoMAE locally, follow these steps:

from transformers import VideoMAEImageProcessor, VideoMAEForPreTraining
import numpy as np
import torch

num_frames = 16
video = list(np.random.randn(16, 3, 224, 224))

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-large")
model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-large")

pixel_values = processor(video, return_tensors="pt").pixel_values

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
loss = outputs.loss

For efficient processing, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.

License

The VideoMAE model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0).

More Related APIs in Video Classification