xclip large patch14 16 frames
microsoftIntroduction
The X-CLIP model is a large-sized video classification model developed by Microsoft, trained on the Kinetics-400 dataset. It extends the CLIP architecture for video-language understanding, using a fully supervised approach to process 16 frames per video at a resolution of 336x336. It achieves high accuracy in video classification tasks, with top-1 and top-5 accuracies of 87.7% and 97.4%, respectively.
Architecture
X-CLIP is designed as an extension of the CLIP model, oriented towards video-language tasks. It employs a contrastive training method on paired video and text data, making it suitable for tasks such as zero-shot, few-shot, and fully supervised video classification, as well as video-text retrieval.
Training
The model was trained using the Kinetics-400 dataset. Preprocessing during training involves resizing frames and normalizing them using ImageNet statistics. The training details are available in the X-CLIP repository, which provides scripts for both training and validation preprocessing.
Guide: Running Locally
- Install Dependencies: Ensure you have Python and PyTorch installed. Use pip to install the
transformers
library from Hugging Face. - Clone the Repository: Clone the X-CLIP GitHub repository for access to code and training scripts.
- Prepare Dataset: Download the Kinetics-400 dataset or use a similar video dataset.
- Run Preprocessing: Use provided scripts to preprocess the dataset, resizing frames and normalizing them.
- Train or Evaluate the Model: Use scripts from the repository to train or evaluate the model on your dataset.
For enhanced performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.
License
The X-CLIP model is released under the MIT License, allowing for wide use and modification.