Vid Tok LLM Model — Open LLM List

Introduction

VidTok is a state-of-the-art family of video tokenizers developed by Microsoft, offering both continuous and discrete tokenizations with varying compression rates. The model delivers superior performance improvements over existing methods by employing efficient architecture, advanced quantization, and enhanced training techniques.

Architecture

VidTok's architecture features:

Efficient Architecture: Spatial and temporal sampling are separated to maintain quality while reducing computational complexity.
Advanced Quantization: Finite Scalar Quantization (FSQ) is used to mitigate training instability and codebook collapse.
Enhanced Training: A two-stage training strategy involving pre-training on low-resolution videos followed by fine-tuning on high-resolution ones, allowing better motion dynamics representation.

Training

Training Data

VidTok's training data comprises:

Training Set 1: Around 400K low-resolution videos (e.g., 480p) with diverse lighting, motions, and scenarios.
Training Set 2: About 10K high-resolution videos (e.g., 1080p) featuring varied lighting, motions, and scenarios.

Training Procedure

Detailed training instructions can be found in the VidTok paper and GitHub repository.

Guide: Running Locally

To run VidTok locally, follow these steps:

Clone the GitHub Repository: Use git clone https://github.com/microsoft/VidTok to obtain the code.
Install Dependencies: Ensure all necessary libraries and dependencies are installed.
Run the Model: Execute the scripts provided in the repository to perform video tokenization tasks.

For enhanced performance, especially when dealing with large datasets, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

VidTok is released under the MIT License, which allows for flexibility in use, modification, and distribution. The full license details can be found here.

More Related APIs