speecht5_tts

microsoft

Introduction

SpeechT5 is a text-to-speech (TTS) model fine-tuned for speech synthesis using the LibriTTS dataset. It is part of the SpeechT5 framework, which is designed for unified-modal encoder-decoder pre-training for spoken language processing. This model is released under the MIT License.

Architecture

The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific pre/post-nets for speech/text. It processes input speech/text through pre-nets, uses the encoder-decoder for sequence-to-sequence transformation, and generates output using post-nets. The framework leverages large-scale unlabeled speech and text data to learn unified-modal representations. It employs a cross-modal vector quantization approach to align textual and speech information.

Training

SpeechT5 was trained on the LibriTTS dataset, focusing on learning a unified-modal representation to enhance modeling capabilities for both speech and text. Specific training hyperparameters and precision details are not provided.

Guide: Running Locally

  1. Install Dependencies:
    Install the necessary libraries:

    pip install --upgrade pip
    pip install --upgrade transformers sentencepiece datasets[audio]
    
  2. Run Inference via TTS Pipeline:
    Use the TTS pipeline to synthesize speech:

    from transformers import pipeline
    import soundfile as sf
    synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts")
    speech = synthesiser("Hello, my dog is cooler than you!")
    sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
    
  3. Run Inference via Modelling Code:
    For more control, use the processor and model directly:

    from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
    import torch
    import soundfile as sf
    
    processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
    model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
    vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
    
    inputs = processor(text="Hello, my dog is cute.", return_tensors="pt")
    speech = model.generate_speech(inputs["input_ids"], vocoder=vocoder)
    sf.write("speech.wav", speech.numpy(), samplerate=16000)
    

Cloud GPUs: For enhanced performance, consider using cloud-based GPUs such as AWS EC2, Google Cloud Platform, or Azure.

License

This model is released under the MIT License.

More Related APIs in Text To Speech