Introduction

viXTTS is a voice generation model that enables voice cloning across different languages using a brief 6-second audio sample. This model is adapted from XTTS-v2.0.3 by expanding its tokenizer to include Vietnamese and training it on the viVoice dataset.

Architecture

The viXTTS model supports 18 languages, including English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, Hindi, and Vietnamese. It is fine-tuned specifically for Vietnamese using a dataset called viVoice.

Training

viXTTS is fine-tuned from the XTTS-v2.0.3 model by incorporating the Vietnamese language into its tokenizer and training on the viVoice dataset. The model currently faces limitations such as incompatibility with the original TTS library and suboptimal performance for short Vietnamese sentences.

Guide: Running Locally

  1. Setup Environment: Ensure Python and necessary libraries are installed. Clone the repository and navigate to the project directory.
  2. Download the Model: Obtain the model files from the Hugging Face repository.
  3. Install Dependencies: Use a package manager like pip to install required dependencies.
  4. Run the Model: Execute the scripts provided in the repository for testing the model with audio samples.

For enhanced performance, utilize cloud GPUs such as those offered by Google Cloud or AWS.

License

This model is licensed under the Coqui Public Model License. More details can be found at Coqui Public Model License.

More Related APIs in Text To Speech