Tsukasa_ Speech

Respair

Introduction

Tsukasa Speech is a Japanese text-to-speech (TTS) generation model designed to enhance the naturalness and expressiveness of generated speech. Developed as part of a personal project, it leverages StyleTTS2 architecture with significant modifications to improve performance and control.

Architecture

Tsukasa Speech employs the StyleTTS2 framework with several enhancements:

  • Incorporates mLSTM layers instead of standard PyTorch LSTM layers.
  • Utilizes a retrained PL-Bert, Pitch Extractor, and Text Aligner.
  • Replaces WavLM with Whisper's Encoder for SLM.
  • Uses a 48khz configuration for improved audio quality.
  • Enhances non-verbal sound generation, including sighs and pauses.
  • Introduces a novel method for sampling style vectors and synthesizes promptable speech.
  • Implements a Smart Phonemization algorithm capable of handling Romaji or mixed inputs.
  • Addresses distributed data parallel (DDP) and BF16 training issues.

Training

The model was trained on approximately 800 hours of high-quality studio data sourced from games and novels, focusing on "anime Japanese." It underwent several stages of training:

  • First stage: Basic model training.
  • Second stage: Advanced feature training.
  • Third stage: Kotodama and prompt encoding, not currently planned.

Training utilized 8x A40s and 2x V100s (32GB each), running for about three weeks with a total carbon footprint of approximately 66.6 kg CO2eq.

Guide: Running Locally

Prerequisites

  1. Python version 3.11 or higher.
  2. Clone the repository:
    git clone https://huggingface.co/Respair/Tsukasa_Speech
    cd Tsukasa_Speech
    
  3. Install dependencies:
    pip install -r requirements.txt
    

Inference

Run the Gradio demo:

python app_tsuka.py

Alternatively, use the provided inference notebook.

Suggested Cloud GPUs

For optimal performance, consider using cloud services that provide access to GPUs like NVIDIA A40 or V100.

License

The project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0).

More Related APIs in Text To Speech