videomae large finetuned kinetics
MCG-NJUIntroduction
VideoMAE is a video classification model developed by extending Masked Autoencoders (MAE) to video data. It is pre-trained for 1600 epochs in a self-supervised manner and fine-tuned on the Kinetics-400 dataset. The model is designed to perform video classification tasks efficiently by learning robust video representations.
Architecture
VideoMAE utilizes an architecture similar to the Vision Transformer (ViT), with an additional decoder for predicting pixel values of masked patches. Videos are divided into fixed-size patches (16x16 resolution) that are linearly embedded. A classification token ([CLS]) is added to the sequence for classification tasks, and fixed position embeddings are applied before processing by the Transformer encoder layers.
Training
Training Data
The training data section is to be contributed by the community.
Training Procedure
The training procedure includes pre-training in a self-supervised manner and fine-tuning on the Kinetics-400 dataset. Details on preprocessing and pre-training are open for community contribution.
Evaluation Results
VideoMAE achieves a top-1 accuracy of 84.7% and a top-5 accuracy of 96.5% on the Kinetics-400 test set.
Guide: Running Locally
To use VideoMAE for video classification, follow these steps:
- Install Required Libraries: Ensure that
transformers
,torch
, andnumpy
are installed. - Load the Model and Processor:
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification import numpy as np import torch video = list(np.random.randn(16, 3, 224, 224)) processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics") model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
- Prepare Inputs and Run Inference:
inputs = processor(video, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx])
- Cloud GPUs: Consider using cloud services like AWS, GCP, or Azure for GPU resources to handle video data processing more efficiently.
License
The model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0).