Introduction

StyleTTS2 is a text-to-speech (TTS) model available on Hugging Face, presented as an ONNX conversion of the original PyTorch model yl4579/StyleTTS2-LibriTTS. This model is specifically designed for TTS inference and is optimized for CPU usage.

Architecture

The model is based on a conversion from PyTorch to ONNX, allowing it to be chunked into four parts for lazy loading. It powers a WebUI for text-to-speech tasks. Despite being a direct conversion, the ONNX model has not been performance optimized and is primarily intended for CPU usage. It is related to, but distinct from, a variant called Kokoro, which uses a different decoder architecture.

Training

The original model was trained on a subset of LibriTTS by the authors of StyleTTS 2. The current ONNX model inherits the weights and architecture from the original PyTorch model without modification.

Guide: Running Locally

  1. Clone the Repository: Download the ONNX model from Hugging Face.
  2. Install Dependencies: Ensure that ONNX and other necessary libraries are installed.
  3. Inference: Use the model in a CPU environment. For GPU usage, the PyTorch version is recommended due to better performance.

Cloud GPUs

While this ONNX model is optimized for CPU usage, for GPU deployment, consider using cloud services like AWS, Google Cloud, or Azure that support PyTorch models for better performance.

License

The ONNX model follows the MIT license, inheriting it from the original StyleTTS2 model. This allows for flexible use and distribution, subject to the terms of the license.

More Related APIs in Text To Speech