F5 T T S stabilized L J Speech
sinhprousIntroduction
The F5-TTS model is a fine-tuned text-to-speech system based on the LJSpeech dataset. It focuses on enhancing stability to prevent choppiness, mispronunciations, repetitions, and skipped words. The model utilizes phoneme conversion for text input during training and a duration predictor during inference.
Architecture
- Base Model: SWivid/F5-TTS
- Phoneme Alignment: Text input is transformed into phonemes for training.
- Duration Predictor: Utilized during inference for timing predictions.
- Source Code:
Training
- Total Training Duration: 130,000 steps
- Configuration:
- Learning Rate: 1e-05
- Batch Size per GPU: 2000 (frame type)
- Max Samples: 64
- Gradient Accumulation Steps: 1
- Max Gradient Norm: 1
- Epochs: 144
- Warmup Updates: 5838
- Checkpoints Saved Every: 11676 updates
- Mixed Precision: FP16
- Logger: W&B (Weights & Biases)
- Optimizer: bnb_optimizer
Guide: Running Locally
- Clone the Repository:
git clone https://github.com/SWivid/F5-TTS
- Install Dependencies: Ensure you have all required Python packages installed, typically found in a
requirements.txt
file. - Configure Environment: Set up necessary environment variables or configuration files as needed.
- Run the Model: Execute the model with the appropriate script, often something like
python run.py
. - Cloud GPUs: For optimal performance, consider using cloud services like AWS EC2 with GPU instances or Google Cloud Platform's Compute Engine with GPUs.
License
This model is licensed under the Creative Commons Attribution Non Commercial Share Alike 4.0 license, allowing free use, modification, and distribution for non-commercial purposes.