tts hifigan ljspeech LLM Model

Introduction

This repository provides tools for using a HiFIGAN vocoder trained on the LJSpeech dataset. The vocoder converts input spectrograms into waveforms, typically following a TTS model that transforms text into a spectrogram. The sampling frequency for the output is 22050 Hz.

Architecture

The HiFIGAN vocoder is designed for speech synthesis, specifically converting spectrograms to waveforms. It is trained on the LJSpeech dataset, which features a single speaker. While it can generalize to different speakers, for optimal results, multi-speaker vocoders trained on datasets like LibriTTS are recommended.

Training

The model was trained using the SpeechBrain framework. To train the model from scratch, follow these steps:

Clone SpeechBrain:

git clone https://github.com/speechbrain/speechbrain/

Install Dependencies:

cd speechbrain
pip install -r requirements.txt
pip install -e .

Run Training:

cd recipes/LJSpeech/TTS/vocoder/hifi_gan/
python train.py hparams/train.yaml --data_folder /path/to/LJspeech

Training results, including models and logs, can be accessed here.

Guide: Running Locally

Basic Steps

Install SpeechBrain:
```
pip install speechbrain
```

Basic Usage:

import torch
from speechbrain.inference.vocoders import HIFIGAN

hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir="pretrained_models/tts-hifigan-ljspeech")
mel_specs = torch.rand(2, 80, 298)
waveforms = hifi_gan.decode_batch(mel_specs)

Using with TTS:

import torchaudio
from speechbrain.inference.TTS import Tacotron2

tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir="pretrained_models/tts-tacotron2-ljspeech")
mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
waveforms = hifi_gan.decode_batch(mel_output)
torchaudio.save('example_TTS.wav', waveforms.squeeze(1), 22050)

Cloud GPUs

For enhanced performance, especially during inference, using cloud GPUs is recommended. Popular platforms include AWS, Google Cloud, and Azure, which provide access to powerful GPUs on demand.

License

The project is licensed under the Apache-2.0 License, allowing for broad use and modification of the software.

More Related APIs in Text To Speech