videomae base LLM Model — Open LLM List

Introduction

VideoMAE is a pre-trained model designed for video classification tasks. It extends the concept of Masked Autoencoders (MAE) to videos, leveraging self-supervised learning to efficiently process video data. The model was introduced by Tong et al. in the paper "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training".

Architecture

The architecture of VideoMAE resembles that of a Vision Transformer (ViT), incorporating a decoder for predicting pixel values in masked patches. Videos are processed as sequences of fixed-size patches, which are linearly embedded. A classification token ([CLS]) is added to the sequence to facilitate classification tasks. The model employs fixed sinusoidal/cosinusoidal position embeddings before inputting the sequence into the Transformer encoder layers. This approach enables the model to learn a comprehensive representation of videos, which can be utilized in various downstream tasks.

Training

The VideoMAE model is pre-trained on the Kinetics-400 dataset for 1600 epochs using a self-supervised approach. This pre-training equips the model with a foundational understanding of video content, which can be fine-tuned for specific tasks such as video classification.

Guide: Running Locally

Installation: Ensure you have Python and PyTorch installed. Install the transformers library from Hugging Face using pip:
```
pip install transformers
```

Model Loading: Use the following code to load the VideoMAE model and processor:

from transformers import VideoMAEImageProcessor, VideoMAEForPreTraining

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base")
model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")

Inference: Prepare your video data and run the model to predict masked patches:

import numpy as np
import torch

num_frames = 16
video = list(np.random.randn(16, 3, 224, 224))
pixel_values = processor(video, return_tensors="pt").pixel_values

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
loss = outputs.loss

Cloud GPUs: For better performance, particularly with large video datasets, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure.

License

The model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0). This license allows for sharing and adaptation of the model for non-commercial purposes, provided appropriate credit is given.

More Related APIs in Video Classification