tts fastspeech2 baker ch

tensorspeech

Introduction

FastSpeech2 is a pretrained text-to-speech model designed for Chinese language synthesis, trained on the Baker dataset. It is an implementation within the TensorFlowTTS library, offering efficient and high-quality end-to-end speech synthesis capabilities.

Architecture

FastSpeech2 is built for generating mel spectrograms from text, which can then be used to produce audio. It is designed to be fast and deliver high-quality outputs by leveraging advanced deep learning techniques in speech synthesis. The model is implemented using TensorFlow, making it suitable for various applications in the field.

Training

The model has been trained using the Baker dataset, a comprehensive Chinese dataset suitable for speech synthesis tasks. The training process ensures that the model can handle various nuances in Chinese speech patterns, providing a robust solution for text-to-speech applications.

Guide: Running Locally

  1. Install TensorFlowTTS:
    Execute the following command to install the required library:

    pip install TensorFlowTTS
    
  2. Convert Text to Mel Spectrogram:
    Use the following Python script to convert text to a mel spectrogram:

    import numpy as np
    import soundfile as sf
    import yaml
    import tensorflow as tf
    from tensorflow_tts.inference import AutoProcessor, TFAutoModel
    
    processor = AutoProcessor.from_pretrained("tensorspeech/tts-fastspeech2-baker-ch")
    fastspeech2 = TFAutoModel.from_pretrained("tensorspeech/tts-fastspeech2-baker-ch")
    
    text = "这是一个开源的端到端中文语音合成系统"
    input_ids = processor.text_to_sequence(text, inference=True)
    
    mel_before, mel_after, duration_outputs, _, _ = fastspeech2.inference(
        input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
        speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
        speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
        f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
        energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    )
    
  3. Cloud GPU Recommendation:
    For more efficient processing, consider using cloud GPU services such as AWS, Google Cloud, or Azure to run the model.

License

The model and its associated code are licensed under the Apache 2.0 License, allowing for wide usage and modification while maintaining the original authors' rights.

More Related APIs in Text To Speech