videomae base finetuned kinetics

MCG-NJU

Introduction

VIDEOMAE is a video classification model based on the VideoMAE architecture, pre-trained for 1600 epochs in a self-supervised manner and fine-tuned on the Kinetics-400 dataset. This model leverages the capabilities of Masked Autoencoders (MAE) to efficiently learn video representations. It was introduced by Tong et al. in the paper "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training."

Architecture

VideoMAE extends the concept of Masked Autoencoders to video data. Its architecture resembles a Vision Transformer (ViT) with an additional decoder for predicting pixel values in masked patches. Videos are divided into sequences of fixed-size patches (16x16 resolution), which are then linearly embedded. The model also uses a [CLS] token for classification tasks and applies sinusoidal position embeddings before processing through the Transformer encoder layers. The pre-trained model facilitates feature extraction for various downstream tasks by learning comprehensive video representations.

Training

The model was pre-trained using a self-supervised approach on a large dataset and fine-tuned on the Kinetics-400 for video classification. The detailed training data, preprocessing, and pretraining procedures are yet to be documented and contributors are encouraged to provide this information. The model achieves a top-1 accuracy of 80.9% and a top-5 accuracy of 94.7% on the Kinetics-400 test set.

Guide: Running Locally

To run the VideoMAE model for video classification, follow these steps:

  1. Install Dependencies: Ensure transformers, torch, and numpy are installed in your Python environment.
  2. Load the Model: Use the transformers library to load the pre-trained VideoMAE model and processor.
  3. Prepare Video Input: Convert your video into a format suitable for the model, typically a sequence of frames with dimensions [16, 3, 224, 224].
  4. Run Inference: Process the video input through the model to obtain classification logits and determine the predicted class.

Example code:

from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import numpy as np
import torch

video = list(np.random.randn(16, 3, 224, 224))

processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")

inputs = processor(video, return_tensors="pt")

with torch.no_grad():
  outputs = model(**inputs)
  logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Cloud GPUs: For efficient computation, consider using cloud-based GPU services like AWS, Google Cloud, or Azure.

License

The model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. This allows for sharing and adaptation of the model for non-commercial purposes, with appropriate credit given to the original authors.

More Related APIs in Video Classification