speecht5_tts
microsoftIntroduction
SpeechT5 is a text-to-speech (TTS) model fine-tuned for speech synthesis using the LibriTTS dataset. It is part of the SpeechT5 framework, which is designed for unified-modal encoder-decoder pre-training for spoken language processing. This model is released under the MIT License.
Architecture
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific pre/post-nets for speech/text. It processes input speech/text through pre-nets, uses the encoder-decoder for sequence-to-sequence transformation, and generates output using post-nets. The framework leverages large-scale unlabeled speech and text data to learn unified-modal representations. It employs a cross-modal vector quantization approach to align textual and speech information.
Training
SpeechT5 was trained on the LibriTTS dataset, focusing on learning a unified-modal representation to enhance modeling capabilities for both speech and text. Specific training hyperparameters and precision details are not provided.
Guide: Running Locally
-
Install Dependencies:
Install the necessary libraries:pip install --upgrade pip pip install --upgrade transformers sentencepiece datasets[audio]
-
Run Inference via TTS Pipeline:
Use the TTS pipeline to synthesize speech:from transformers import pipeline import soundfile as sf synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts") speech = synthesiser("Hello, my dog is cooler than you!") sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
-
Run Inference via Modelling Code:
For more control, use the processor and model directly:from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan import torch import soundfile as sf processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts") model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts") vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan") inputs = processor(text="Hello, my dog is cute.", return_tensors="pt") speech = model.generate_speech(inputs["input_ids"], vocoder=vocoder) sf.write("speech.wav", speech.numpy(), samplerate=16000)
Cloud GPUs: For enhanced performance, consider using cloud-based GPUs such as AWS EC2, Google Cloud Platform, or Azure.
License
This model is released under the MIT License.