xclip base patch16
microsoftIntroduction
X-CLIP is a video classification model designed as an extension of the CLIP model for video-language understanding. It is trained to perform tasks such as zero-shot, few-shot, and fully supervised video classification and video-text retrieval. This base-sized model uses a patch resolution of 16 and was trained on the Kinetics-400 dataset.
Architecture
The X-CLIP model builds upon the CLIP architecture to handle video data. It employs a contrastive training approach using (video, text) pairs. The architecture allows the model to be versatile in various video classification tasks. The model processes 8 frames per video at a resolution of 224x224.
Training
X-CLIP was trained on the Kinetics-400 dataset, which is a large-scale video dataset. The training involves preprocessing steps where each frame's shorter edge is resized, center-cropped to 224x224, and normalized using ImageNet mean and standard deviation. The model achieves a top-1 accuracy of 83.8% and a top-5 accuracy of 95.7%.
Guide: Running Locally
- Setup Environment: Ensure you have Python and PyTorch installed. Install the Hugging Face Transformers library.
pip install transformers
- Load Model: Use the Transformers library to load the X-CLIP model.
from transformers import XClipModel, XClipProcessor model = XClipModel.from_pretrained("microsoft/xclip-base-patch16") processor = XClipProcessor.from_pretrained("microsoft/xclip-base-patch16")
- Prepare Data: Resize and normalize videos as per the preprocessing steps used in training.
- Inference: Use the model for video classification or retrieval tasks.
Cloud GPUs: Consider using cloud services like AWS, Google Cloud, or Azure for access to GPUs, which can accelerate model inference and training.
License
The X-CLIP model is licensed under the MIT License, allowing for reuse with attribution.