tts_en_fastpitch

nvidia

Introduction

The NVIDIA FastPitch model is a fully-parallel text-to-speech system based on FastSpeech, designed for English (en-US) and compatible with NVIDIA Riva for production-grade deployments. It enables prosody control over pitch and individual phoneme duration, utilizing an unsupervised speech-text aligner.

Architecture

FastPitch employs a fully-parallel Transformer architecture, predicting pitch contours during inference to generate more expressive speech. It synthesizes mel spectrograms with a higher real-time factor than Tacotron2. The architecture is conditioned on fundamental frequency contours, allowing for expressive and semantically matched speech generation.

Training

The model was trained using the NeMo toolkit for 1000 epochs. Training utilized scripts and configurations available in the NeMo GitHub repository. The dataset used is LJSpeech, sampled at 22050Hz, focusing on generating female voices with an American accent.

Guide: Running Locally

  1. Install NVIDIA NeMo and PyTorch:

    pip install nemo_toolkit['all']
    
  2. Load FastPitch Model:

    from nemo.collections.tts.models import FastPitchModel
    spec_generator = FastPitchModel.from_pretrained("nvidia/tts_en_fastpitch")
    
  3. Load Vocoder (e.g., HiFiGAN):

    from nemo.collections.tts.models import HifiGanModel
    model = HifiGanModel.from_pretrained(model_name="nvidia/tts_hifigan")
    
  4. Generate Audio:

    import soundfile as sf
    parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
    spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
    audio = model.convert_spectrogram_to_audio(spec=spectrogram)
    
  5. Save to Disk:

    sf.write("speech.wav", audio.to('cpu').detach().numpy()[0], 22050)
    

For optimal performance, consider deploying on cloud GPUs using services like AWS, Google Cloud, or Azure.

License

This model is licensed under the Creative Commons Attribution 4.0 International License (cc-by-4.0).

More Related APIs in Text To Speech