memo
memoavatarIntroduction
MEMO (Memory-Guided Diffusion for Expressive Talking Video Generation) is a model designed for generating talking videos using memory-guided diffusion. The project is developed by a team including Longtao Zheng, Yifan Zhang, and others. The model allows for expressive and realistic talking video synthesis from given images and audio inputs.
Architecture
MEMO utilizes a memory-guided diffusion approach, which integrates both face analysis and vocal separation models to produce expressive talking videos. This technique allows for smooth and realistic video generation by leveraging diffusion processes guided by memory elements.
Training
The model leverages high-quality open-source datasets like HDTF, VFHQ, CelebV-HQ, MultiTalk, and MEAD for training. The training process involves adapting these datasets to enhance the model's ability to generate high-fidelity talking videos. The model's performance is optimized for speed and quality on advanced GPUs.
Guide: Running Locally
Installation
- Create and activate a Python environment:
conda create -n memo python=3.10 -y conda activate memo
- Install dependencies:
conda install -c conda-forge ffmpeg -y pip install -e .
Inference
Run the following command to perform inference:
python inference.py --config configs/inference.yaml --input_image <IMAGE_PATH> --input_audio <AUDIO_PATH> --output_dir <SAVE_PATH>
Example usage:
python inference.py --config configs/inference.yaml --input_image assets/examples/dicaprio.jpg --input_audio assets/examples/speech.wav --output_dir outputs
Suggested Cloud GPUs
For optimal performance, the model is tested on NVIDIA H100 and RTX 4090 GPUs. Consider using cloud services that offer these GPUs to reduce inference time.
License
The code and model are released under the Apache 2.0 License, which permits open-source usage with attribution. Users must comply with ethical guidelines and avoid misuse, especially in generating misleading or unauthorized content.