Introduction

Vevo is a versatile zero-shot voice imitation framework with controllable timbre and style. It allows for various voice transformations, including style-preserved voice conversion, style conversion such as accent and emotion conversion, and text-to-speech (TTS) with controllable style and timbre. The models are trained on the Emilia Dataset, which contains 101k hours of speech data in multiple languages.

Architecture

Vevo utilizes several components for voice imitation and synthesis:

  • Content Tokenizer: Converts speech to content tokens using a single codebook VQ-VAE with a vocabulary size of 32.
  • Content-Style Tokenizer: Converts speech to content-style tokens with a vocabulary size of 8192.
  • Vq32ToVq8192: Predicts content-style tokens from content tokens using an auto-regressive transformer.
  • PhoneToVq8192: Predicts content-style tokens from phone tokens using an auto-regressive transformer.
  • Vq8192ToMels: Generates mel-spectrograms from content-style tokens with a flow-matching transformer.
  • Vocoder: Converts mel-spectrograms to audio using a Vocos-based vocoder.

Training

The pre-trained models are available for download and use. They are built to handle a range of tasks from voice conversion to TTS, trained on the extensive Emilia Dataset, which includes diverse linguistic data.

Guide: Running Locally

Basic Steps

  1. Clone the Repository: Download the Amphion GitHub repository.
  2. Set Up Environment: Ensure you have Python and PyTorch installed. Use a virtual environment for package management.
  3. Download Pre-trained Models: Use huggingface_hub to download the necessary checkpoints.
  4. Run Inference: Use the provided script to perform tasks such as zero-shot TTS.

Cloud GPUs

For optimal performance, especially for large models, consider using cloud GPU services like AWS EC2 or Google Cloud's AI Platform.

License

Vevo is released under the CC-BY-NC-4.0 license, allowing for non-commercial usage with appropriate credit.

More Related APIs in Text To Speech