tts fastspeech2 kss ko

tensorspeech

TTS-FastSpeech2-KSS-KO

Introduction

TTS-FastSpeech2-KSS-KO is a pretrained FastSpeech2 model designed for text-to-speech tasks, specifically tailored to the Korean language using the KSS dataset. It leverages the TensorFlowTTS library to facilitate efficient and high-quality text-to-speech conversion.

Architecture

The model is based on FastSpeech2, a text-to-mel spectrogram conversion architecture known for its speed and quality. It uses a sequence-to-sequence approach without relying on attention mechanisms, which enhances its efficiency.

Training

The model is trained on the KSS dataset, a Korean speech dataset. FastSpeech2 uses a robust training pipeline to convert text inputs into mel spectrograms, which can subsequently be transformed into audio waveforms.

Guide: Running Locally

  1. Install TensorFlowTTS: First, ensure you have TensorFlowTTS installed using:

    pip install TensorFlowTTS
    
  2. Convert Text to Mel Spectrogram:

    import numpy as np
    import tensorflow as tf
    from tensorflow_tts.inference import AutoProcessor, TFAutoModel
    
    processor = AutoProcessor.from_pretrained("tensorspeech/tts-fastspeech2-kss-ko")
    fastspeech2 = TFAutoModel.from_pretrained("tensorspeech/tts-fastspeech2-kss-ko")
    
    text = "신은 우리의 수학 문제에는 관심이 없다. 신은 다만 경험적으로 통합할 뿐이다."
    input_ids = processor.text_to_sequence(text)
    
    mel_before, mel_after, duration_outputs, _, _ = fastspeech2.inference(
        input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
        speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
        speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
        f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
        energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    )
    
  3. Hardware Recommendations: For optimal performance, especially when handling large datasets or deploying in production, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.

License

The project is licensed under the Apache-2.0 License, allowing for both personal and commercial use, modification, and distribution of the model and its codebase.

More Related APIs in Text To Speech