tortoise tts v2 LLM Model

Introduction

Tortoise is a text-to-speech (TTS) program designed for strong multi-voice capabilities and realistic prosody and intonation. It supports generating speech with various voices, including completely random ones, and allows for extensive voice customization using reference clips.

Architecture

Tortoise TTS is inspired by OpenAI's DALLE and utilizes a combination of autoregressive and diffusion decoders to process speech data. The system consists of five separate models that work in tandem, with detailed documentation available here.

Training

The models were trained using approximately 50,000 hours of speech data on a homelab server with 8 RTX 3090 GPUs over several months. The training utilized custom software and was conducted without plans to release specific configurations or methodologies due to ethical considerations regarding misuse.

Guide: Running Locally

Installation

To run Tortoise TTS locally, ensure you have an NVIDIA GPU. Follow these steps:

Install PyTorch by following the PyTorch installation guide.

Clone the repository and install:

git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python setup.py install

Usage

To speak a phrase with random voices:

python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast

For reading large text files:

python tortoise/read.py --textfile <your text to be read> --voice random

Consider using cloud GPUs for enhanced performance and faster processing, especially if working with extensive datasets or generating complex outputs.

License

Tortoise TTS is an open-source project. Users are encouraged to cite the repository if utilized in research. A BibTeX entry can be found on the project's GitHub page. The classifier, Tortoise-Detect, ensures responsible usage by identifying audio generated by the Tortoise model.

More Related APIs