f5 tts mlx
lucasnewmanF5 TTS — MLX
Introduction
F5 TTS is a non-autoregressive, zero-shot text-to-speech system designed for the MLX framework. It utilizes a flow-matching mel spectrogram generator combined with a diffusion transformer (DiT). The model has been reshaped for use with MLX from its original version.
Architecture
The architecture of F5 TTS employs a diffusion transformer to generate mel spectrograms, which are then used for producing speech. This design allows for efficient and high-quality text-to-speech conversion in a zero-shot manner.
Training
The original weights of F5 TTS have been adapted for the MLX framework. The model is trained to perform non-autoregressive text-to-speech synthesis, enabling quick generation of speech from textual input.
Guide: Running Locally
-
Installation: Install the F5 TTS MLX package using pip:
pip install f5-tts-mlx
-
Basic Usage: Generate speech from text using the command:
python -m f5_tts_mlx.generate --text "The quick brown fox jumped over the lazy dog."
-
Using Reference Audio: To use your own reference audio sample, ensure it is a mono, 24kHz WAV file of 5-10 seconds:
python -m f5_tts_mlx.generate \ --text "The quick brown fox jumped over the lazy dog." \ --ref-audio /path/to/audio.wav \ --ref-text "This is the caption for the reference audio."
-
Audio Conversion: Convert audio files to the required format using ffmpeg:
ffmpeg -i /path/to/audio.wav -ac 1 -ar 24000 -sample_fmt s16 -t 10 /path/to/output_audio.wav
-
Loading Pretrained Model: Load a pretrained model in Python:
from f5_tts_mlx.generate import generate audio = generate(text="Hello world.", ...)
- Cloud GPUs: For more intensive tasks or faster processing, consider using cloud GPU services such as AWS, Google Cloud, or Azure.
License
This project is licensed under the MIT License.