xclip large patch14 16 frames

microsoft

Introduction

The X-CLIP model is a large-sized video classification model developed by Microsoft, trained on the Kinetics-400 dataset. It extends the CLIP architecture for video-language understanding, using a fully supervised approach to process 16 frames per video at a resolution of 336x336. It achieves high accuracy in video classification tasks, with top-1 and top-5 accuracies of 87.7% and 97.4%, respectively.

Architecture

X-CLIP is designed as an extension of the CLIP model, oriented towards video-language tasks. It employs a contrastive training method on paired video and text data, making it suitable for tasks such as zero-shot, few-shot, and fully supervised video classification, as well as video-text retrieval.

Training

The model was trained using the Kinetics-400 dataset. Preprocessing during training involves resizing frames and normalizing them using ImageNet statistics. The training details are available in the X-CLIP repository, which provides scripts for both training and validation preprocessing.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python and PyTorch installed. Use pip to install the transformers library from Hugging Face.
  2. Clone the Repository: Clone the X-CLIP GitHub repository for access to code and training scripts.
  3. Prepare Dataset: Download the Kinetics-400 dataset or use a similar video dataset.
  4. Run Preprocessing: Use provided scripts to preprocess the dataset, resizing frames and normalizing them.
  5. Train or Evaluate the Model: Use scripts from the repository to train or evaluate the model on your dataset.

For enhanced performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

The X-CLIP model is released under the MIT License, allowing for wide use and modification.

More Related APIs in Video Classification