videomae large finetuned kinetics

MCG-NJU

Introduction

VideoMAE is a video classification model developed by extending Masked Autoencoders (MAE) to video data. It is pre-trained for 1600 epochs in a self-supervised manner and fine-tuned on the Kinetics-400 dataset. The model is designed to perform video classification tasks efficiently by learning robust video representations.

Architecture

VideoMAE utilizes an architecture similar to the Vision Transformer (ViT), with an additional decoder for predicting pixel values of masked patches. Videos are divided into fixed-size patches (16x16 resolution) that are linearly embedded. A classification token ([CLS]) is added to the sequence for classification tasks, and fixed position embeddings are applied before processing by the Transformer encoder layers.

Training

Training Data

The training data section is to be contributed by the community.

Training Procedure

The training procedure includes pre-training in a self-supervised manner and fine-tuning on the Kinetics-400 dataset. Details on preprocessing and pre-training are open for community contribution.

Evaluation Results

VideoMAE achieves a top-1 accuracy of 84.7% and a top-5 accuracy of 96.5% on the Kinetics-400 test set.

Guide: Running Locally

To use VideoMAE for video classification, follow these steps:

  1. Install Required Libraries: Ensure that transformers, torch, and numpy are installed.
  2. Load the Model and Processor:
    from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
    import numpy as np
    import torch
    
    video = list(np.random.randn(16, 3, 224, 224))
    
    processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
    model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-large-finetuned-kinetics")
    
  3. Prepare Inputs and Run Inference:
    inputs = processor(video, return_tensors="pt")
    
    with torch.no_grad():
      outputs = model(**inputs)
      logits = outputs.logits
    
    predicted_class_idx = logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class_idx])
    
  4. Cloud GPUs: Consider using cloud services like AWS, GCP, or Azure for GPU resources to handle video data processing more efficiently.

License

The model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0).

More Related APIs in Video Classification