speecht5_tts LLM Model — Open LLM List

Introduction

SpeechT5 is a text-to-speech (TTS) model fine-tuned for speech synthesis using the LibriTTS dataset. It is part of the SpeechT5 framework, which is designed for unified-modal encoder-decoder pre-training for spoken language processing. This model is released under the MIT License.

Architecture

The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific pre/post-nets for speech/text. It processes input speech/text through pre-nets, uses the encoder-decoder for sequence-to-sequence transformation, and generates output using post-nets. The framework leverages large-scale unlabeled speech and text data to learn unified-modal representations. It employs a cross-modal vector quantization approach to align textual and speech information.

Training

SpeechT5 was trained on the LibriTTS dataset, focusing on learning a unified-modal representation to enhance modeling capabilities for both speech and text. Specific training hyperparameters and precision details are not provided.

Guide: Running Locally

Install Dependencies:
Install the necessary libraries:

pip install --upgrade pip
pip install --upgrade transformers sentencepiece datasets[audio]

Run Inference via TTS Pipeline:
Use the TTS pipeline to synthesize speech:

from transformers import pipeline
import soundfile as sf
synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts")
speech = synthesiser("Hello, my dog is cooler than you!")
sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])

Run Inference via Modelling Code:
For more control, use the processor and model directly:

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch
import soundfile as sf

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

inputs = processor(text="Hello, my dog is cute.", return_tensors="pt")
speech = model.generate_speech(inputs["input_ids"], vocoder=vocoder)
sf.write("speech.wav", speech.numpy(), samplerate=16000)

Cloud GPUs: For enhanced performance, consider using cloud-based GPUs such as AWS EC2, Google Cloud Platform, or Azure.

License

This model is released under the MIT License.

More Related APIs in Text To Speech