parler tts mini v1
parler-ttsIntroduction
Parler-TTS Mini v1 is a lightweight text-to-speech (TTS) model developed to generate high-quality, natural-sounding speech. It allows control over features such as gender, background noise, speaking rate, pitch, and reverberation through a simple text prompt. The model is part of the Parler-TTS project, which provides TTS training resources and dataset pre-processing code.
Architecture
The Parler-TTS Mini v1 model is built on the Transformers library. It has been trained on 45,000 hours of audio data and supports features like speaker specification and prosody control. The model includes 34 speaker profiles, enabling varied speech generation.
Training
Parler-TTS was trained using open-source datasets, including mls_eng, libritts_r_filtered, and others. The training aimed to reproduce the work described in the paper "Natural language guidance of high-fidelity text-to-speech with synthetic annotations" by Dan Lyth and Simon King. The training code and model weights are publicly available under a permissive license, allowing for community-driven improvements and customizations.
Guide: Running Locally
Installation
To use Parler-TTS, install it via:
pip install git+https://github.com/huggingface/parler-tts.git
Generating Speech
To generate speech with a random voice:
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")
prompt = "Hey, how are you doing today?"
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch."
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
Using a Specific Speaker
Specify the speaker in the text description:
description = "Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."
Cloud GPUs
For optimal performance, especially for larger models, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
This model is licensed under the Apache 2.0 license, allowing for broad usage and modification rights.