xclip large patch14

microsoft

Introduction

The X-CLIP large-sized model is a video classification model developed by Microsoft, designed for general video-language understanding. It is a minimal extension of the CLIP model, trained in a contrastive manner on video and text pairs. The model is trained on the Kinetics-400 dataset and is suitable for tasks such as zero-shot, few-shot, or fully supervised video classification and video-text retrieval.

Architecture

The X-CLIP model adopts a patch resolution of 14 and processes 8 frames per video at a resolution of 224x224. It leverages a contrastive learning approach to align video and text representations, enabling versatile applications in video-language tasks.

Training

The model was trained on the Kinetics-400 dataset, which comprises a wide range of video clips for action recognition. During training, video frames are resized, center-cropped, and normalized using ImageNet's mean and standard deviation. More details on preprocessing can be found in the linked scripts.

Guide: Running Locally

To run X-CLIP locally, follow these steps:

  1. Install Dependencies: Ensure you have Python and PyTorch installed. You may also need to install the Hugging Face Transformers library.

    pip install torch transformers
    
  2. Download the Model: Access the model from the Hugging Face model hub.

  3. Prepare Data: Preprocess your video data to match the training specifications (e.g., resizing frames to 224x224).

  4. Run Inference: Use the Transformers library to load the model and run it on your video data.

For optimal performance, especially with large models, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

The X-CLIP model is released under the MIT License, allowing for wide usage and modification with minimal restrictions.

More Related APIs in Video Classification