parler tts mini v1

parler-tts

Introduction

Parler-TTS Mini v1 is a lightweight text-to-speech (TTS) model developed to generate high-quality, natural-sounding speech. It allows control over features such as gender, background noise, speaking rate, pitch, and reverberation through a simple text prompt. The model is part of the Parler-TTS project, which provides TTS training resources and dataset pre-processing code.

Architecture

The Parler-TTS Mini v1 model is built on the Transformers library. It has been trained on 45,000 hours of audio data and supports features like speaker specification and prosody control. The model includes 34 speaker profiles, enabling varied speech generation.

Training

Parler-TTS was trained using open-source datasets, including mls_eng, libritts_r_filtered, and others. The training aimed to reproduce the work described in the paper "Natural language guidance of high-fidelity text-to-speech with synthetic annotations" by Dan Lyth and Simon King. The training code and model weights are publicly available under a permissive license, allowing for community-driven improvements and customizations.

Guide: Running Locally

Installation

To use Parler-TTS, install it via:

pip install git+https://github.com/huggingface/parler-tts.git

Generating Speech

To generate speech with a random voice:

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

prompt = "Hey, how are you doing today?"
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Using a Specific Speaker

Specify the speaker in the text description:

description = "Jon's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise."

Cloud GPUs

For optimal performance, especially for larger models, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

This model is licensed under the Apache 2.0 license, allowing for broad usage and modification rights.

More Related APIs in Text To Speech