Introduction

The F5-TTS model is a text-to-speech system finetuned for Spanish language speech synthesis. It aims to provide high-quality, regionally diverse speech capabilities for Spanish speakers.

Architecture

The base model used for F5-TTS is SWivid/F5-TTS. The training involved 218 hours of audio data, configured with a batch size of 3200, a maximum of 64 samples, and 1,200,000 training steps.

Training

The F5-TTS model was trained on various datasets, including the Voxpopuli Dataset and crowdsourced high-quality Spanish speech data from different regions such as Argentina, Chile, Colombia, Peru, Puerto Rico, and Venezuela. The TEDx Spanish Corpus was also utilized.

Guide: Running Locally

Method 1: Manual Model Replacement

  1. Run the Application: Launch the F5-TTS application and check the terminal for the model file path.
  2. Replace the Model File:
    • Navigate to the file location.
    • Rename the existing model file to model_1200000.safetensors.bak.
    • Download and save model_1200000.safetensors from the repository to the same location.
  3. Restart the Application: Relaunch to load the updated model.

Alternative Methods

  • GitHub Repository: Clone the Spanish-F5 repository and follow installation instructions.
  • Google Colab: Use the model in Google Colab.
    • Change runtime type to T4 GPU and run all cells.
    • Access the public URL provided.
  • Jupyter Notebook: Run using the Spanish_F5.ipynb notebook.

Cloud GPUs

For efficient execution, consider using cloud GPUs like those available on Google Colab or AWS.

License

The F5-TTS model is released under the CC0-1.0 license, allowing for free use, modification, and distribution.

More Related APIs