xclip base patch32 16 frames
microsoftIntroduction
The X-CLIP model, developed by Microsoft, is a base-sized model designed for video classification tasks. It extends the CLIP framework to handle general video-language understanding, using a contrastive training approach on video-text pairs.
Architecture
X-CLIP is a minimal extension of CLIP, intended for video-language tasks like zero-shot and few-shot video classification, as well as video-text retrieval. The model processes video inputs using a patch resolution of 32, with 16 frames per video at a resolution of 224x224 pixels.
Training
The model was trained fully-supervised on the Kinetics-400 dataset, which is well-known for video classification tasks. Training involved specific preprocessing steps such as resizing frames, center cropping to a fixed resolution, and normalizing with ImageNet mean and standard deviation.
Guide: Running Locally
To run the X-CLIP model locally, follow these steps:
-
Install Dependencies: Ensure that you have Python installed, along with PyTorch and the Hugging Face Transformers library.
pip install torch transformers
-
Download the Model: Use the Hugging Face model hub to download the X-CLIP model.
-
Load the Model in Your Script:
from transformers import XClipModel, XClipProcessor model = XClipModel.from_pretrained("microsoft/xclip-base-patch32-16-frames") processor = XClipProcessor.from_pretrained("microsoft/xclip-base-patch32-16-frames")
-
Prepare Your Data: Follow the preprocessing steps outlined in the repository to prepare your video data.
-
Inference: Use the model and processor to run inference on your data.
For optimal performance, especially with large datasets or real-time processing, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.
License
The X-CLIP model is released under the MIT license, allowing for wide usage and modification.