F5 TTS — MLX

Introduction

F5 TTS is a non-autoregressive, zero-shot text-to-speech system designed for the MLX framework. It utilizes a flow-matching mel spectrogram generator combined with a diffusion transformer (DiT). The model has been reshaped for use with MLX from its original version.

Architecture

The architecture of F5 TTS employs a diffusion transformer to generate mel spectrograms, which are then used for producing speech. This design allows for efficient and high-quality text-to-speech conversion in a zero-shot manner.

Training

The original weights of F5 TTS have been adapted for the MLX framework. The model is trained to perform non-autoregressive text-to-speech synthesis, enabling quick generation of speech from textual input.

Guide: Running Locally

  1. Installation: Install the F5 TTS MLX package using pip:

    pip install f5-tts-mlx
    
  2. Basic Usage: Generate speech from text using the command:

    python -m f5_tts_mlx.generate --text "The quick brown fox jumped over the lazy dog."
    
  3. Using Reference Audio: To use your own reference audio sample, ensure it is a mono, 24kHz WAV file of 5-10 seconds:

    python -m f5_tts_mlx.generate \
    --text "The quick brown fox jumped over the lazy dog." \
    --ref-audio /path/to/audio.wav \
    --ref-text "This is the caption for the reference audio."
    
  4. Audio Conversion: Convert audio files to the required format using ffmpeg:

    ffmpeg -i /path/to/audio.wav -ac 1 -ar 24000 -sample_fmt s16 -t 10 /path/to/output_audio.wav
    
  5. Loading Pretrained Model: Load a pretrained model in Python:

    from f5_tts_mlx.generate import generate
    
    audio = generate(text="Hello world.", ...)
    
  • Cloud GPUs: For more intensive tasks or faster processing, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

This project is licensed under the MIT License.

More Related APIs