tts_hifigan

nvidia

Introduction

HiFiGAN is a generative adversarial network (GAN) model developed by NVIDIA for converting mel spectrograms into audio. It is implemented in PyTorch and is part of the NVIDIA NeMo toolkit. This model is optimized for English speech synthesis and can be used for generating high-quality audio, particularly when combined with spectrogram generators like FastPitch.

Architecture

The HiFiGAN model comprises one generator and two discriminators: the multi-scale and multi-period discriminators. These components work adversarially, and additional losses are utilized to enhance training stability and performance. The generator uses transposed convolutions to upsample mel spectrograms into audio.

Training

HiFiGAN was trained using the NeMo toolkit with LJSpeech dataset sampled at 22050Hz, focusing on generating female English voices with an American accent. Training involved several epochs utilizing specific scripts and configuration files available within the NeMo GitHub repository.

Guide: Running Locally

  1. Installation: Install NVIDIA NeMo after setting up the latest version of PyTorch.

    pip install nemo_toolkit['all']
    
  2. Model Setup:

    • Load the FastPitch model for spectrogram generation:
      from nemo.collections.tts.models import FastPitchModel
      spec_generator = FastPitchModel.from_pretrained("nvidia/tts_en_fastpitch")
      
    • Load the HiFiGAN vocoder:
      from nemo.collections.tts.models import HifiGanModel
      model = HifiGanModel.from_pretrained(model_name="nvidia/tts_hifigan")
      
  3. Audio Generation:

    • Generate audio from a given text:
      import soundfile as sf
      parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
      spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
      audio = model.convert_spectrogram_to_audio(spec=spectrogram)
      sf.write("speech.wav", audio.to('cpu').numpy(), 22050)
      
  4. Cloud GPU Recommendation: For efficient training and inference, consider using cloud-based GPUs such as those provided by AWS, Google Cloud, or Azure.

License

The HiFiGAN model is released under the Creative Commons Attribution 4.0 International License (cc-by-4.0), allowing for use, distribution, and modification with proper attribution.

More Related APIs in Text To Speech