HALLO2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Introduction

HALLO2 is a project developed by researchers from Fudan University, Baidu Inc, and Nanjing University. It focuses on creating long-duration and high-resolution animations of portrait images driven by audio inputs. The technology aims to animate images based on audio cues, producing realistic and detailed animations.

Architecture

The framework of HALLO2 utilizes a combination of models and technologies, including denoising UNet, face locators, and image & audio projectors. The architecture integrates various pretrained models for tasks such as face analysis, audio processing, and animation.

Training

Training for HALLO2 is split into two parts: long-duration animation and high-resolution animation. For long-duration animation, the training data involves talking-face videos meeting specific face orientation and size criteria. The training process uses distributed computing frameworks like Accelerate for efficient training across multiple nodes. High-resolution animation training uses the VFHQ dataset, with models trained using PyTorch's distributed launch capabilities.

Guide: Running Locally

Basic Steps

  1. Set Up Environment:
    • Use Ubuntu 20.04/22.04 with Cuda 11.8.
    • Create a conda environment and install necessary packages:
      conda create -n hallo python=3.10
      conda activate hallo
      pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
      pip install -r requirements.txt
      apt-get install ffmpeg
      
  2. Download Pretrained Models:
    • Clone the models from the HuggingFace repository:
      git lfs install
      git clone https://huggingface.co/fudan-generative-ai/hallo2 pretrained_models
      
  3. Prepare Inference Data:
    • Ensure source images are square with the face occupying 50-70% and facing forward.
    • Driving audio must be in WAV format and in English.
  4. Run Inference:
    • Execute inference scripts for long-duration or high-resolution animations:
      python scripts/inference_long.py --config ./configs/inference/long.yaml
      python scripts/video_sr.py --input_path [input_video] --output_path [output_dir]
      

Cloud GPUs

For enhanced performance, consider using cloud-based GPU services like AWS, Google Cloud, or Azure to handle computationally intensive tasks.

License

HALLO2 is released under the MIT License. Note that some components, such as the high-resolution animation feature, have specific license requirements (S-Lab License 1.0) that must be respected.

More Related APIs