Hunyuan Video H F I E legacy dont use this

jbilcke-hf

Introduction

HunyuanVideo is an open-source video foundation model designed for large-scale video generation. It aims to rival closed-source models by offering superior video generation capabilities through a comprehensive training framework. The model features a robust infrastructure that supports over 13 billion parameter training, seeking to enhance video quality, motion diversity, and text alignment.

Architecture

HunyuanVideo employs a spatial-temporally compressed latent space, using Causal 3D VAE for compression. It incorporates a Multimodal Large Language Model (MLLM) for text encoding, enhancing image-text alignment and reasoning. The architecture includes a "Dual-stream to Single-stream" hybrid model, leveraging Transformers for effective multimodal fusion. A 3D VAE is used for compressing videos into a compact latent space, optimizing token usage for diffusion models.

Training

The model is trained on a large dataset with over 13 billion parameters, focusing on high visual quality and stable video generation. Key features include unified image and video architecture, MLLM text encoding, and an innovative prompt rewrite mechanism to adapt user prompts for improved video generation. These features collectively enable robust training and inference capabilities, outperforming state-of-the-art models in professional evaluations.

Guide: Running Locally

To run the HunyuanVideo model locally:

  1. Clone the repository:

    git clone https://github.com/tencent/HunyuanVideo
    cd HunyuanVideo
    
  2. Set up a Conda environment:

    conda env create -f environment.yml
    conda activate HunyuanVideo
    
  3. Install dependencies:

    python -m pip install -r requirements.txt
    python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.9.post1
    
  4. Download pretrained models: Follow instructions here.

  5. Run inference:

    python3 sample_video.py --video-size 720 1280 --video-length 129 --infer-steps 30 --prompt "a cat is running, realistic." --flow-reverse --seed 0 --use-cpu-offload --save-path ./results
    

Cloud GPUs: For optimal performance, use an NVIDIA GPU with at least 60GB memory. Recommended is an 80GB GPU for better quality.

License

The HunyuanVideo model is licensed under the Tencent Hunyuan Community License. For more details, please refer to the LICENSE file provided in the repository.

More Related APIs in Text To Video