Cog Video X1.5 5 B S A T LLM Model

Introduction
CogVideoX1.5-5B-SAT is an open-source video generation model developed by the Knowledge Engineering Group at Tsinghua University. It is an enhanced version of the CogVideoX model, designed to support 10-second videos and higher resolutions. This model includes the SAT-weight version and offers support for various resolutions through its CogVideoX1.5-5B-I2V variant.

Architecture
The CogVideoX1.5-5B-SAT model comprises several key components:

Transformer: Includes weights for both Image-to-Video (I2V) and Text-to-Video (T2V) models.
- transformer_i2v and transformer_t2v directories contain model state files.
VAE: Consistent with the CogVideoX-5B series, requiring no updates. It includes the 3d-vae.pt module.
Text Encoder: Matches the diffusers version of CogVideoX-5B, requiring no updates. It includes various configuration and model files necessary for text encoding.

Training
The model's training details are not explicitly described but are referenced in the associated academic paper, which can be accessed via the provided arXiv link.

Guide: Running Locally

Clone the Repository: Clone the CogVideoX1.5-5B-SAT repository from Hugging Face.
Install Dependencies: Ensure all necessary libraries and frameworks are installed. These may include PyTorch, Hugging Face Transformers, and others.
Download Model Weights: Select and download the appropriate model weights for I2V or T2V as needed.
Inference: Use the downloaded weights to perform inference. Follow the repository's instructions for running the model with your inputs.

For optimal performance, it is recommended to utilize cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License
The CogVideoX1.5-5B-SAT model is released under the CogVideoX LICENSE. Users should review the license for specific terms and conditions.