Introduction

F5-TTS is a text-to-speech (TTS) model developed by SWivid, designed to generate fluent and faithful speech using flow matching techniques. The model utilizes the amphion/Emilia-Dataset and is available for use under a specific license.

Architecture

The F5-TTS model is built on the F5-TTS library, leveraging advanced TTS algorithms to convert text inputs into speech outputs. The architecture is designed to handle diverse speech patterns and maintain high fidelity in speech synthesis.

Training

The model was trained using the amphion/Emilia-Dataset, an in-the-wild dataset that provides a rich source of diverse speech data. The training process emphasizes maintaining the authenticity and fluency of the generated speech.

Guide: Running Locally

To run F5-TTS locally, follow these steps:

  1. Download the Model: Obtain the necessary model files from the provided links:

  2. Setup Directory: Place the downloaded models in a directory structured as follows:

    ckpts/
        E2TTS_Base/
            model_1200000.pt
        F5TTS_Base/
            model_1200000.pt
    
  3. Inference: Use the .safetensors option for model inference if needed:

    ckpts/
        E2TTS_Base/
            model_1200000.safetensors
        F5TTS_Base/
            model_1200000.safetensors
    
  4. Consider Cloud GPUs: For efficient model inference, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

The F5-TTS model is available under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license due to the inclusion of the Emilia training dataset. The codebase remains under the MIT License. The change in the model license reflects the terms of the dataset used for training.

More Related APIs in Text To Speech