Vevo
amphionIntroduction
Vevo is a versatile zero-shot voice imitation framework with controllable timbre and style. It allows for various voice transformations, including style-preserved voice conversion, style conversion such as accent and emotion conversion, and text-to-speech (TTS) with controllable style and timbre. The models are trained on the Emilia Dataset, which contains 101k hours of speech data in multiple languages.
Architecture
Vevo utilizes several components for voice imitation and synthesis:
- Content Tokenizer: Converts speech to content tokens using a single codebook VQ-VAE with a vocabulary size of 32.
- Content-Style Tokenizer: Converts speech to content-style tokens with a vocabulary size of 8192.
- Vq32ToVq8192: Predicts content-style tokens from content tokens using an auto-regressive transformer.
- PhoneToVq8192: Predicts content-style tokens from phone tokens using an auto-regressive transformer.
- Vq8192ToMels: Generates mel-spectrograms from content-style tokens with a flow-matching transformer.
- Vocoder: Converts mel-spectrograms to audio using a Vocos-based vocoder.
Training
The pre-trained models are available for download and use. They are built to handle a range of tasks from voice conversion to TTS, trained on the extensive Emilia Dataset, which includes diverse linguistic data.
Guide: Running Locally
Basic Steps
- Clone the Repository: Download the Amphion GitHub repository.
- Set Up Environment: Ensure you have Python and PyTorch installed. Use a virtual environment for package management.
- Download Pre-trained Models: Use
huggingface_hub
to download the necessary checkpoints. - Run Inference: Use the provided script to perform tasks such as zero-shot TTS.
Cloud GPUs
For optimal performance, especially for large models, consider using cloud GPU services like AWS EC2 or Google Cloud's AI Platform.
License
Vevo is released under the CC-BY-NC-4.0 license, allowing for non-commercial usage with appropriate credit.