Hunyuan Video
tencentIntroduction
HunyuanVideo is a novel open-source video foundation model by Tencent, designed for large-scale video generation. It integrates data curation, joint image-video model training, and an efficient infrastructure for training and inference. Boasting over 13 billion parameters, it is the largest open-source video generative model available. HunyuanVideo aims to rival closed-source models, offering high visual quality, motion diversity, and text-video alignment.
Architecture
HunyuanVideo employs a spatial-temporally compressed latent space using a Causal 3D VAE. Text prompts are encoded by a Multimodal Large Language Model (MLLM) and used as conditions for video generation. The system includes a unified image and video generative architecture, leveraging a Transformer design with "Dual-stream to Single-stream" processing for effective multimodal fusion.
Training
The model is trained on a compressed latent space with Causal 3D VAE. Video tokens are reduced in number, allowing training at original resolution and frame rate. The MLLM text encoder aids in better image-text alignment and complex reasoning, while the prompt rewrite model adapts user prompts for improved video generation quality.
Model Stats Number
HunyuanVideo has over 13 billion parameters, making it the largest open-source model of its kind. It supports video generation with resolutions such as 720px x 1280px and 544px x 960px.
Guide: Running Locally
-
Clone the Repository:
git clone https://github.com/tencent/HunyuanVideo cd HunyuanVideo
-
Set Up Environment:
- Use Conda for environment setup:
conda env create -f environment.yml conda activate HunyuanVideo python -m pip install -r requirements.txt
- Install Flash Attention v2:
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.9.post1
- Use Conda for environment setup:
-
Docker Option:
- Download and load Docker image:
wget https://aivideo.hunyuan.tencent.com/download/HunyuanVideo/hunyuan_video_cu12.tar docker load -i hunyuan_video.tar
- Download and load Docker image:
-
Inference Example:
python3 sample_video.py \ --video-size 720 1280 \ --video-length 129 \ --infer-steps 30 \ --prompt "a cat is running, realistic." \ --flow-reverse \ --seed 0 \ --use-cpu-offload \ --save-path ./results
Cloud GPUs
For optimal performance, use an NVIDIA GPU with at least 60GB of memory. A single 80GB GPU is recommended. Consider using cloud services like AWS or Google Cloud for access to powerful GPUs.
License
HunyuanVideo is released under the tencent-hunyuan-community license. For detailed licensing information, refer to the LICENSE file in the repository.