tts hifigan ljspeech

speechbrain

Introduction

This repository provides tools for using a HiFIGAN vocoder trained on the LJSpeech dataset. The vocoder converts input spectrograms into waveforms, typically following a TTS model that transforms text into a spectrogram. The sampling frequency for the output is 22050 Hz.

Architecture

The HiFIGAN vocoder is designed for speech synthesis, specifically converting spectrograms to waveforms. It is trained on the LJSpeech dataset, which features a single speaker. While it can generalize to different speakers, for optimal results, multi-speaker vocoders trained on datasets like LibriTTS are recommended.

Training

The model was trained using the SpeechBrain framework. To train the model from scratch, follow these steps:

  1. Clone SpeechBrain:

    git clone https://github.com/speechbrain/speechbrain/
    
  2. Install Dependencies:

    cd speechbrain
    pip install -r requirements.txt
    pip install -e .
    
  3. Run Training:

    cd recipes/LJSpeech/TTS/vocoder/hifi_gan/
    python train.py hparams/train.yaml --data_folder /path/to/LJspeech
    

Training results, including models and logs, can be accessed here.

Guide: Running Locally

Basic Steps

  1. Install SpeechBrain:

    pip install speechbrain
    
  2. Basic Usage:

    import torch
    from speechbrain.inference.vocoders import HIFIGAN
    
    hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="pretrained_models/tts-hifigan-ljspeech")
    mel_specs = torch.rand(2, 80, 298)
    waveforms = hifi_gan.decode_batch(mel_specs)
    
  3. Using with TTS:

    import torchaudio
    from speechbrain.inference.TTS import Tacotron2
    
    tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir="pretrained_models/tts-tacotron2-ljspeech")
    mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
    waveforms = hifi_gan.decode_batch(mel_output)
    torchaudio.save('example_TTS.wav', waveforms.squeeze(1), 22050)
    

Cloud GPUs

For enhanced performance, especially during inference, using cloud GPUs is recommended. Popular platforms include AWS, Google Cloud, and Azure, which provide access to powerful GPUs on demand.

License

The project is licensed under the Apache-2.0 License, allowing for broad use and modification of the software.

More Related APIs in Text To Speech