Hunyuan Video H F I E legacy dont use this
jbilcke-hfIntroduction
HunyuanVideo is an open-source video foundation model designed for large-scale video generation. It aims to rival closed-source models by offering superior video generation capabilities through a comprehensive training framework. The model features a robust infrastructure that supports over 13 billion parameter training, seeking to enhance video quality, motion diversity, and text alignment.
Architecture
HunyuanVideo employs a spatial-temporally compressed latent space, using Causal 3D VAE for compression. It incorporates a Multimodal Large Language Model (MLLM) for text encoding, enhancing image-text alignment and reasoning. The architecture includes a "Dual-stream to Single-stream" hybrid model, leveraging Transformers for effective multimodal fusion. A 3D VAE is used for compressing videos into a compact latent space, optimizing token usage for diffusion models.
Training
The model is trained on a large dataset with over 13 billion parameters, focusing on high visual quality and stable video generation. Key features include unified image and video architecture, MLLM text encoding, and an innovative prompt rewrite mechanism to adapt user prompts for improved video generation. These features collectively enable robust training and inference capabilities, outperforming state-of-the-art models in professional evaluations.
Guide: Running Locally
To run the HunyuanVideo model locally:
-
Clone the repository:
git clone https://github.com/tencent/HunyuanVideo cd HunyuanVideo
-
Set up a Conda environment:
conda env create -f environment.yml conda activate HunyuanVideo
-
Install dependencies:
python -m pip install -r requirements.txt python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.9.post1
-
Download pretrained models: Follow instructions here.
-
Run inference:
python3 sample_video.py --video-size 720 1280 --video-length 129 --infer-steps 30 --prompt "a cat is running, realistic." --flow-reverse --seed 0 --use-cpu-offload --save-path ./results
Cloud GPUs: For optimal performance, use an NVIDIA GPU with at least 60GB memory. Recommended is an 80GB GPU for better quality.
License
The HunyuanVideo model is licensed under the Tencent Hunyuan Community License. For more details, please refer to the LICENSE file provided in the repository.