Introduction

Hunyuan-DiT is a text-to-image diffusion transformer designed with a bilingual architecture for fine-grained understanding of both English and Chinese. The model incorporates a unique transformer structure, text encoder, and positional encoding to facilitate multi-round, multi-modal dialogue interactions. It has been evaluated to set new standards in Chinese-to-image generation.

Architecture

Hunyuan-DiT employs a diffusion model in the latent space, compressing images into low-dimensional latent spaces using a pre-trained Variational Autoencoder (VAE). The model is parameterized with a transformer and utilizes bilingual CLIP and multilingual T5 encoders for text prompts. It supports multi-turn text-to-image generation, allowing iterative and dynamic image creation based on user dialogue.

Training

The training incorporates a Multimodal Large Language Model (MLLM) to refine image captions and facilitate multi-round dialogue for image generation. The model sets a benchmark through a comprehensive evaluation protocol involving over 50 professional human evaluators.

Guide: Running Locally

Requirements

  • GPU: NVIDIA with CUDA support (V100/A100 recommended).
  • Memory: Minimum 11GB; 32GB recommended for optimal performance.
  • OS: Linux

Steps

  1. Clone the Repository:

    git clone https://github.com/tencent/HunyuanDiT
    cd HunyuanDiT
    
  2. Set Up Environment: Use Conda to create and activate the environment:

    conda env create -f environment.yml
    conda activate HunyuanDiT
    python -m pip install -r requirements.txt
    
  3. Optional Installation: For acceleration, install Flash Attention v2:

    python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.1.2.post3
    
  4. Download Pretrained Models: Install huggingface-cli and download models:

    python -m pip install "huggingface_hub[cli]"
    mkdir ckpts
    huggingface-cli download Tencent-Hunyuan/HunyuanDiT --local-dir ./ckpts
    
  5. Run Inference:

    • Using Gradio:
      python app/hydit_app.py
      
    • Using Command Line:
      python sample_t2i.py --prompt "渔舟唱晚"
      

Suggested Cloud GPUs

For enhanced performance, consider using cloud services like AWS EC2 or Google Cloud with V100 or A100 GPUs.

License

The Hunyuan-DiT is released under the Tencent Hunyuan Community License. More details can be found here.

More Related APIs