videomae base
MCG-NJUIntroduction
VideoMAE is a pre-trained model designed for video classification tasks. It extends the concept of Masked Autoencoders (MAE) to videos, leveraging self-supervised learning to efficiently process video data. The model was introduced by Tong et al. in the paper "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training".
Architecture
The architecture of VideoMAE resembles that of a Vision Transformer (ViT), incorporating a decoder for predicting pixel values in masked patches. Videos are processed as sequences of fixed-size patches, which are linearly embedded. A classification token ([CLS]) is added to the sequence to facilitate classification tasks. The model employs fixed sinusoidal/cosinusoidal position embeddings before inputting the sequence into the Transformer encoder layers. This approach enables the model to learn a comprehensive representation of videos, which can be utilized in various downstream tasks.
Training
The VideoMAE model is pre-trained on the Kinetics-400 dataset for 1600 epochs using a self-supervised approach. This pre-training equips the model with a foundational understanding of video content, which can be fine-tuned for specific tasks such as video classification.
Guide: Running Locally
- Installation: Ensure you have Python and PyTorch installed. Install the
transformers
library from Hugging Face using pip:pip install transformers
- Model Loading: Use the following code to load the VideoMAE model and processor:
from transformers import VideoMAEImageProcessor, VideoMAEForPreTraining processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base") model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")
- Inference: Prepare your video data and run the model to predict masked patches:
import numpy as np import torch num_frames = 16 video = list(np.random.randn(16, 3, 224, 224)) pixel_values = processor(video, return_tensors="pt").pixel_values num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2 seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool() outputs = model(pixel_values, bool_masked_pos=bool_masked_pos) loss = outputs.loss
- Cloud GPUs: For better performance, particularly with large video datasets, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure.
License
The model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0). This license allows for sharing and adaptation of the model for non-commercial purposes, provided appropriate credit is given.