Tsukasa_ Speech
RespairIntroduction
Tsukasa Speech is a Japanese text-to-speech (TTS) generation model designed to enhance the naturalness and expressiveness of generated speech. Developed as part of a personal project, it leverages StyleTTS2 architecture with significant modifications to improve performance and control.
Architecture
Tsukasa Speech employs the StyleTTS2 framework with several enhancements:
- Incorporates mLSTM layers instead of standard PyTorch LSTM layers.
- Utilizes a retrained PL-Bert, Pitch Extractor, and Text Aligner.
- Replaces WavLM with Whisper's Encoder for SLM.
- Uses a 48khz configuration for improved audio quality.
- Enhances non-verbal sound generation, including sighs and pauses.
- Introduces a novel method for sampling style vectors and synthesizes promptable speech.
- Implements a Smart Phonemization algorithm capable of handling Romaji or mixed inputs.
- Addresses distributed data parallel (DDP) and BF16 training issues.
Training
The model was trained on approximately 800 hours of high-quality studio data sourced from games and novels, focusing on "anime Japanese." It underwent several stages of training:
- First stage: Basic model training.
- Second stage: Advanced feature training.
- Third stage: Kotodama and prompt encoding, not currently planned.
Training utilized 8x A40s and 2x V100s (32GB each), running for about three weeks with a total carbon footprint of approximately 66.6 kg CO2eq.
Guide: Running Locally
Prerequisites
- Python version 3.11 or higher.
- Clone the repository:
git clone https://huggingface.co/Respair/Tsukasa_Speech cd Tsukasa_Speech
- Install dependencies:
pip install -r requirements.txt
Inference
Run the Gradio demo:
python app_tsuka.py
Alternatively, use the provided inference notebook.
Suggested Cloud GPUs
For optimal performance, consider using cloud services that provide access to GPUs like NVIDIA A40 or V100.
License
The project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (cc-by-nc-4.0).