videomae base

MCG-NJU

Introduction

VideoMAE is a pre-trained model designed for video classification tasks. It extends the concept of Masked Autoencoders (MAE) to videos, leveraging self-supervised learning to efficiently process video data. The model was introduced by Tong et al. in the paper "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training".

Architecture

The architecture of VideoMAE resembles that of a Vision Transformer (ViT), incorporating a decoder for predicting pixel values in masked patches. Videos are processed as sequences of fixed-size patches, which are linearly embedded. A classification token ([CLS]) is added to the sequence to facilitate classification tasks. The model employs fixed sinusoidal/cosinusoidal position embeddings before inputting the sequence into the Transformer encoder layers. This approach enables the model to learn a comprehensive representation of videos, which can be utilized in various downstream tasks.

Training

The VideoMAE model is pre-trained on the Kinetics-400 dataset for 1600 epochs using a self-supervised approach. This pre-training equips the model with a foundational understanding of video content, which can be fine-tuned for specific tasks such as video classification.

Guide: Running Locally

  1. Installation: Ensure you have Python and PyTorch installed. Install the transformers library from Hugging Face using pip:
    pip install transformers
    
  2. Model Loading: Use the following code to load the VideoMAE model and processor:
    from transformers import VideoMAEImageProcessor, VideoMAEForPreTraining
    
    processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base")
    model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")
    
  3. Inference: Prepare your video data and run the model to predict masked patches:
    import numpy as np
    import torch
    
    num_frames = 16
    video = list(np.random.randn(16, 3, 224, 224))
    pixel_values = processor(video, return_tensors="pt").pixel_values
    
    num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
    seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
    bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()
    
    outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
    loss = outputs.loss
    
  4. Cloud GPUs: For better performance, particularly with large video datasets, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure.

License

The model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0). This license allows for sharing and adaptation of the model for non-commercial purposes, provided appropriate credit is given.

More Related APIs in Video Classification