Hunyuan Video H F I E legacy dont use this LLM Model

Introduction

HunyuanVideo is an open-source video foundation model designed for large-scale video generation. It aims to rival closed-source models by offering superior video generation capabilities through a comprehensive training framework. The model features a robust infrastructure that supports over 13 billion parameter training, seeking to enhance video quality, motion diversity, and text alignment.

Architecture

HunyuanVideo employs a spatial-temporally compressed latent space, using Causal 3D VAE for compression. It incorporates a Multimodal Large Language Model (MLLM) for text encoding, enhancing image-text alignment and reasoning. The architecture includes a "Dual-stream to Single-stream" hybrid model, leveraging Transformers for effective multimodal fusion. A 3D VAE is used for compressing videos into a compact latent space, optimizing token usage for diffusion models.

Training

The model is trained on a large dataset with over 13 billion parameters, focusing on high visual quality and stable video generation. Key features include unified image and video architecture, MLLM text encoding, and an innovative prompt rewrite mechanism to adapt user prompts for improved video generation. These features collectively enable robust training and inference capabilities, outperforming state-of-the-art models in professional evaluations.

Guide: Running Locally

To run the HunyuanVideo model locally:

Clone the repository:

git clone https://github.com/tencent/HunyuanVideo
cd HunyuanVideo

Set up a Conda environment:

conda env create -f environment.yml
conda activate HunyuanVideo

Install dependencies:

python -m pip install -r requirements.txt
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.9.post1

Download pretrained models: Follow instructions here.

Run inference:

python3 sample_video.py --video-size 720 1280 --video-length 129 --infer-steps 30 --prompt "a cat is running, realistic." --flow-reverse --seed 0 --use-cpu-offload --save-path ./results

Cloud GPUs: For optimal performance, use an NVIDIA GPU with at least 60GB memory. Recommended is an 80GB GPU for better quality.

License

The HunyuanVideo model is licensed under the Tencent Hunyuan Community License. For more details, please refer to the LICENSE file provided in the repository.