HALLO2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Introduction

HALLO2 is a project developed by researchers from Fudan University, Baidu Inc, and Nanjing University. It focuses on creating long-duration and high-resolution animations of portrait images driven by audio inputs. The technology aims to animate images based on audio cues, producing realistic and detailed animations.

Architecture

The framework of HALLO2 utilizes a combination of models and technologies, including denoising UNet, face locators, and image & audio projectors. The architecture integrates various pretrained models for tasks such as face analysis, audio processing, and animation.

Training

Training for HALLO2 is split into two parts: long-duration animation and high-resolution animation. For long-duration animation, the training data involves talking-face videos meeting specific face orientation and size criteria. The training process uses distributed computing frameworks like Accelerate for efficient training across multiple nodes. High-resolution animation training uses the VFHQ dataset, with models trained using PyTorch's distributed launch capabilities.

Guide: Running Locally

Basic Steps

Set Up Environment:

Use Ubuntu 20.04/22.04 with Cuda 11.8.

Create a conda environment and install necessary packages:

conda create -n hallo python=3.10
conda activate hallo
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
apt-get install ffmpeg

Download Pretrained Models:

Clone the models from the HuggingFace repository:

git lfs install
git clone https://huggingface.co/fudan-generative-ai/hallo2 pretrained_models

Prepare Inference Data:
- Ensure source images are square with the face occupying 50-70% and facing forward.
- Driving audio must be in WAV format and in English.

Run Inference:

Execute inference scripts for long-duration or high-resolution animations:

python scripts/inference_long.py --config ./configs/inference/long.yaml
python scripts/video_sr.py --input_path [input_video] --output_path [output_dir]

Cloud GPUs

For enhanced performance, consider using cloud-based GPU services like AWS, Google Cloud, or Azure to handle computationally intensive tasks.

License

HALLO2 is released under the MIT License. Note that some components, such as the high-resolution animation feature, have specific license requirements (S-Lab License 1.0) that must be respected.

More Related APIs