tts tacotron2 kss ko

tensorspeech

Introduction

This repository provides a pretrained Tacotron2 model with Guided Attention, specifically trained on the KSS (Korean Single Speaker) dataset. It forms part of the TensorFlowTTS library, designed for text-to-speech applications.

Architecture

The Tacotron2 model is a neural network architecture that synthesizes speech from text by predicting mel spectrograms. The model utilizes Guided Attention mechanisms to improve the alignment between text and audio features, facilitating more natural speech synthesis.

Training

The model was trained using the KSS dataset, which is a Korean speech dataset. Detailed training methodologies are documented in the Tacotron2 paper and the Guided Attention research.

Guide: Running Locally

  1. Install TensorFlowTTS:
    Execute the following command to install the required library:

    pip install TensorFlowTTS
    
  2. Converting Text to Mel Spectrogram:

    • Import necessary libraries and modules:
      import numpy as np
      import soundfile as sf
      import yaml
      import tensorflow as tf
      from tensorflow_tts.inference import AutoProcessor, TFAutoModel
      
    • Load the pretrained models:
      processor = AutoProcessor.from_pretrained("tensorspeech/tts-tacotron2-kss-ko")
      tacotron2 = TFAutoModel.from_pretrained("tensorspeech/tts-tacotron2-kss-ko")
      
    • Convert text to mel spectrogram:
      text = "신은 우리의 수학 문제에는 관심이 없다. 신은 다만 경험적으로 통합할 뿐이다."
      input_ids = processor.text_to_sequence(text)
      decoder_output, mel_outputs, stop_token_prediction, alignment_history = tacotron2.inference(
          input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
          input_lengths=tf.convert_to_tensor([len(input_ids)], tf.int32),
          speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
      )
      
  3. Cloud GPUs:
    For efficient processing, consider using cloud-based GPUs such as those offered by Google Cloud, AWS, or Azure.

License

This project is licensed under the Apache-2.0 License, which allows for both personal and commercial use.

More Related APIs in Text To Speech