tortoise tts v2
jbetkerIntroduction
Tortoise is a text-to-speech (TTS) program designed for strong multi-voice capabilities and realistic prosody and intonation. It supports generating speech with various voices, including completely random ones, and allows for extensive voice customization using reference clips.
Architecture
Tortoise TTS is inspired by OpenAI's DALLE and utilizes a combination of autoregressive and diffusion decoders to process speech data. The system consists of five separate models that work in tandem, with detailed documentation available here.
Training
The models were trained using approximately 50,000 hours of speech data on a homelab server with 8 RTX 3090 GPUs over several months. The training utilized custom software and was conducted without plans to release specific configurations or methodologies due to ethical considerations regarding misuse.
Guide: Running Locally
Installation
To run Tortoise TTS locally, ensure you have an NVIDIA GPU. Follow these steps:
- Install PyTorch by following the PyTorch installation guide.
- Clone the repository and install:
git clone https://github.com/neonbjb/tortoise-tts.git cd tortoise-tts python setup.py install
Usage
- To speak a phrase with random voices:
python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
- For reading large text files:
python tortoise/read.py --textfile <your text to be read> --voice random
- Consider using cloud GPUs for enhanced performance and faster processing, especially if working with extensive datasets or generating complex outputs.
License
Tortoise TTS is an open-source project. Users are encouraged to cite the repository if utilized in research. A BibTeX entry can be found on the project's GitHub page. The classifier, Tortoise-Detect, ensures responsible usage by identifying audio generated by the Tortoise model.