xclip base patch16

microsoft

Introduction

X-CLIP is a video classification model designed as an extension of the CLIP model for video-language understanding. It is trained to perform tasks such as zero-shot, few-shot, and fully supervised video classification and video-text retrieval. This base-sized model uses a patch resolution of 16 and was trained on the Kinetics-400 dataset.

Architecture

The X-CLIP model builds upon the CLIP architecture to handle video data. It employs a contrastive training approach using (video, text) pairs. The architecture allows the model to be versatile in various video classification tasks. The model processes 8 frames per video at a resolution of 224x224.

Training

X-CLIP was trained on the Kinetics-400 dataset, which is a large-scale video dataset. The training involves preprocessing steps where each frame's shorter edge is resized, center-cropped to 224x224, and normalized using ImageNet mean and standard deviation. The model achieves a top-1 accuracy of 83.8% and a top-5 accuracy of 95.7%.

Guide: Running Locally

  1. Setup Environment: Ensure you have Python and PyTorch installed. Install the Hugging Face Transformers library.
    pip install transformers
    
  2. Load Model: Use the Transformers library to load the X-CLIP model.
    from transformers import XClipModel, XClipProcessor
    
    model = XClipModel.from_pretrained("microsoft/xclip-base-patch16")
    processor = XClipProcessor.from_pretrained("microsoft/xclip-base-patch16")
    
  3. Prepare Data: Resize and normalize videos as per the preprocessing steps used in training.
  4. Inference: Use the model for video classification or retrieval tasks.

Cloud GPUs: Consider using cloud services like AWS, Google Cloud, or Azure for access to GPUs, which can accelerate model inference and training.

License

The X-CLIP model is licensed under the MIT License, allowing for reuse with attribution.

More Related APIs in Video Classification