xclip base patch32 16 frames

microsoft

Introduction

The X-CLIP model, developed by Microsoft, is a base-sized model designed for video classification tasks. It extends the CLIP framework to handle general video-language understanding, using a contrastive training approach on video-text pairs.

Architecture

X-CLIP is a minimal extension of CLIP, intended for video-language tasks like zero-shot and few-shot video classification, as well as video-text retrieval. The model processes video inputs using a patch resolution of 32, with 16 frames per video at a resolution of 224x224 pixels.

Training

The model was trained fully-supervised on the Kinetics-400 dataset, which is well-known for video classification tasks. Training involved specific preprocessing steps such as resizing frames, center cropping to a fixed resolution, and normalizing with ImageNet mean and standard deviation.

Guide: Running Locally

To run the X-CLIP model locally, follow these steps:

  1. Install Dependencies: Ensure that you have Python installed, along with PyTorch and the Hugging Face Transformers library.

    pip install torch transformers
    
  2. Download the Model: Use the Hugging Face model hub to download the X-CLIP model.

  3. Load the Model in Your Script:

    from transformers import XClipModel, XClipProcessor
    
    model = XClipModel.from_pretrained("microsoft/xclip-base-patch32-16-frames")
    processor = XClipProcessor.from_pretrained("microsoft/xclip-base-patch32-16-frames")
    
  4. Prepare Your Data: Follow the preprocessing steps outlined in the repository to prepare your video data.

  5. Inference: Use the model and processor to run inference on your data.

For optimal performance, especially with large datasets or real-time processing, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.

License

The X-CLIP model is released under the MIT license, allowing for wide usage and modification.

More Related APIs in Video Classification