memo LLM Model — Open LLM List

Introduction

MEMO (Memory-Guided Diffusion for Expressive Talking Video Generation) is a model designed for generating talking videos using memory-guided diffusion. The project is developed by a team including Longtao Zheng, Yifan Zhang, and others. The model allows for expressive and realistic talking video synthesis from given images and audio inputs.

Architecture

MEMO utilizes a memory-guided diffusion approach, which integrates both face analysis and vocal separation models to produce expressive talking videos. This technique allows for smooth and realistic video generation by leveraging diffusion processes guided by memory elements.

Training

The model leverages high-quality open-source datasets like HDTF, VFHQ, CelebV-HQ, MultiTalk, and MEAD for training. The training process involves adapting these datasets to enhance the model's ability to generate high-fidelity talking videos. The model's performance is optimized for speed and quality on advanced GPUs.

Guide: Running Locally

Installation

Create and activate a Python environment:

conda create -n memo python=3.10 -y
conda activate memo

Install dependencies:

conda install -c conda-forge ffmpeg -y
pip install -e .

Inference

Run the following command to perform inference:

python inference.py --config configs/inference.yaml --input_image <IMAGE_PATH> --input_audio <AUDIO_PATH> --output_dir <SAVE_PATH>

Example usage:

python inference.py --config configs/inference.yaml --input_image assets/examples/dicaprio.jpg --input_audio assets/examples/speech.wav --output_dir outputs

Suggested Cloud GPUs

For optimal performance, the model is tested on NVIDIA H100 and RTX 4090 GPUs. Consider using cloud services that offer these GPUs to reduce inference time.

License

The code and model are released under the Apache 2.0 License, which permits open-source usage with attribution. Users must comply with ethical guidelines and avoid misuse, especially in generating misleading or unauthorized content.

More Related APIs