videomae base finetuned kinetics
MCG-NJUIntroduction
VIDEOMAE is a video classification model based on the VideoMAE architecture, pre-trained for 1600 epochs in a self-supervised manner and fine-tuned on the Kinetics-400 dataset. This model leverages the capabilities of Masked Autoencoders (MAE) to efficiently learn video representations. It was introduced by Tong et al. in the paper "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training."
Architecture
VideoMAE extends the concept of Masked Autoencoders to video data. Its architecture resembles a Vision Transformer (ViT) with an additional decoder for predicting pixel values in masked patches. Videos are divided into sequences of fixed-size patches (16x16 resolution), which are then linearly embedded. The model also uses a [CLS] token for classification tasks and applies sinusoidal position embeddings before processing through the Transformer encoder layers. The pre-trained model facilitates feature extraction for various downstream tasks by learning comprehensive video representations.
Training
The model was pre-trained using a self-supervised approach on a large dataset and fine-tuned on the Kinetics-400 for video classification. The detailed training data, preprocessing, and pretraining procedures are yet to be documented and contributors are encouraged to provide this information. The model achieves a top-1 accuracy of 80.9% and a top-5 accuracy of 94.7% on the Kinetics-400 test set.
Guide: Running Locally
To run the VideoMAE model for video classification, follow these steps:
- Install Dependencies: Ensure
transformers
,torch
, andnumpy
are installed in your Python environment. - Load the Model: Use the
transformers
library to load the pre-trained VideoMAE model and processor. - Prepare Video Input: Convert your video into a format suitable for the model, typically a sequence of frames with dimensions [16, 3, 224, 224].
- Run Inference: Process the video input through the model to obtain classification logits and determine the predicted class.
Example code:
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import numpy as np
import torch
video = list(np.random.randn(16, 3, 224, 224))
processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
inputs = processor(video, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
Cloud GPUs: For efficient computation, consider using cloud-based GPU services like AWS, Google Cloud, or Azure.
License
The model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. This allows for sharing and adaptation of the model for non-commercial purposes, with appropriate credit given to the original authors.