Vid Tok
microsoftIntroduction
VidTok is a state-of-the-art family of video tokenizers developed by Microsoft, offering both continuous and discrete tokenizations with varying compression rates. The model delivers superior performance improvements over existing methods by employing efficient architecture, advanced quantization, and enhanced training techniques.
Architecture
VidTok's architecture features:
- Efficient Architecture: Spatial and temporal sampling are separated to maintain quality while reducing computational complexity.
- Advanced Quantization: Finite Scalar Quantization (FSQ) is used to mitigate training instability and codebook collapse.
- Enhanced Training: A two-stage training strategy involving pre-training on low-resolution videos followed by fine-tuning on high-resolution ones, allowing better motion dynamics representation.
Training
Training Data
VidTok's training data comprises:
- Training Set 1: Around 400K low-resolution videos (e.g., 480p) with diverse lighting, motions, and scenarios.
- Training Set 2: About 10K high-resolution videos (e.g., 1080p) featuring varied lighting, motions, and scenarios.
Training Procedure
Detailed training instructions can be found in the VidTok paper and GitHub repository.
Guide: Running Locally
To run VidTok locally, follow these steps:
- Clone the GitHub Repository: Use
git clone https://github.com/microsoft/VidTok
to obtain the code. - Install Dependencies: Ensure all necessary libraries and dependencies are installed.
- Run the Model: Execute the scripts provided in the repository to perform video tokenization tasks.
For enhanced performance, especially when dealing with large datasets, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
VidTok is released under the MIT License, which allows for flexibility in use, modification, and distribution. The full license details can be found here.