parler tts mini multilingual v1.1
parler-ttsIntroduction
Parler-TTS Mini Multilingual v1.1 is a multilingual text-to-speech (TTS) model that extends the capabilities of its predecessor by incorporating a wider range of speaker names and descriptions. It supports eight European languages and benefits from enhanced data tokenization, making it easier to expand to additional languages.
Architecture
The model employs a dual-tokenizer system for improved prompt and description handling. It is built using the transformers
library and is trained on a variety of datasets to support multilingual TTS functionality. The architecture is designed to allow for speaker-specific voice generation, utilizing a list of 16 pre-trained speakers.
Training
Parler-TTS Mini Multilingual v1.1 was trained on approximately 9,200 hours of non-English data and 580 hours of English data. The training datasets include a cleaned version of CML-TTS and the non-English Multilingual LibriSpeech, with the addition of the LibriTTS-R English dataset. The collaboration with HuggingFace, Quantum Squadra, and AI4Bharat teams helped refine the tokenization process for improved language support.
Guide: Running Locally
Basic Steps
-
Installation:
Install the Parler-TTS library using the following command:pip install git+https://github.com/huggingface/parler-tts.git
-
Random Voice Generation:
import torch from parler_tts import ParlerTTSForConditionalGeneration from transformers import AutoTokenizer import soundfile as sf device = "cuda:0" if torch.cuda.is_available() else "cpu" model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1").to(device) tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1") description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path) prompt = "Salut toi, comment vas-tu aujourd'hui?" description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch." input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device) prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) audio_arr = generation.cpu().numpy().squeeze() sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
-
Using a Specific Speaker: Adapt the text description to specify a speaker, for example:
description = "Daniel's voice is monotone yet slightly fast in delivery."
Cloud GPUs
For efficient processing, it's recommended to run the model on cloud GPUs available through platforms like AWS, GCP, or Azure.
License
This model is licensed under the Apache 2.0 license, allowing for extensive modification and distribution.