tts_en_fastpitch
nvidiaIntroduction
The NVIDIA FastPitch model is a fully-parallel text-to-speech system based on FastSpeech, designed for English (en-US) and compatible with NVIDIA Riva for production-grade deployments. It enables prosody control over pitch and individual phoneme duration, utilizing an unsupervised speech-text aligner.
Architecture
FastPitch employs a fully-parallel Transformer architecture, predicting pitch contours during inference to generate more expressive speech. It synthesizes mel spectrograms with a higher real-time factor than Tacotron2. The architecture is conditioned on fundamental frequency contours, allowing for expressive and semantically matched speech generation.
Training
The model was trained using the NeMo toolkit for 1000 epochs. Training utilized scripts and configurations available in the NeMo GitHub repository. The dataset used is LJSpeech, sampled at 22050Hz, focusing on generating female voices with an American accent.
Guide: Running Locally
-
Install NVIDIA NeMo and PyTorch:
pip install nemo_toolkit['all']
-
Load FastPitch Model:
from nemo.collections.tts.models import FastPitchModel spec_generator = FastPitchModel.from_pretrained("nvidia/tts_en_fastpitch")
-
Load Vocoder (e.g., HiFiGAN):
from nemo.collections.tts.models import HifiGanModel model = HifiGanModel.from_pretrained(model_name="nvidia/tts_hifigan")
-
Generate Audio:
import soundfile as sf parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.") spectrogram = spec_generator.generate_spectrogram(tokens=parsed) audio = model.convert_spectrogram_to_audio(spec=spectrogram)
-
Save to Disk:
sf.write("speech.wav", audio.to('cpu').detach().numpy()[0], 22050)
For optimal performance, consider deploying on cloud GPUs using services like AWS, Google Cloud, or Azure.
License
This model is licensed under the Creative Commons Attribution 4.0 International License (cc-by-4.0).