parler tts mini multilingual v1.1 LLM Model

Introduction

Parler-TTS Mini Multilingual v1.1 is a multilingual text-to-speech (TTS) model that extends the capabilities of its predecessor by incorporating a wider range of speaker names and descriptions. It supports eight European languages and benefits from enhanced data tokenization, making it easier to expand to additional languages.

Architecture

The model employs a dual-tokenizer system for improved prompt and description handling. It is built using the transformers library and is trained on a variety of datasets to support multilingual TTS functionality. The architecture is designed to allow for speaker-specific voice generation, utilizing a list of 16 pre-trained speakers.

Training

Parler-TTS Mini Multilingual v1.1 was trained on approximately 9,200 hours of non-English data and 580 hours of English data. The training datasets include a cleaned version of CML-TTS and the non-English Multilingual LibriSpeech, with the addition of the LibriTTS-R English dataset. The collaboration with HuggingFace, Quantum Squadra, and AI4Bharat teams helped refine the tokenization process for improved language support.

Guide: Running Locally

Basic Steps

Installation:
Install the Parler-TTS library using the following command:
```
pip install git+https://github.com/huggingface/parler-tts.git
```

Random Voice Generation:

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-multilingual-v1.1")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)

prompt = "Salut toi, comment vas-tu aujourd'hui?"
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch."
input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Using a Specific Speaker: Adapt the text description to specify a speaker, for example:
```
description = "Daniel's voice is monotone yet slightly fast in delivery."
```

Cloud GPUs

For efficient processing, it's recommended to run the model on cloud GPUs available through platforms like AWS, GCP, or Azure.

License

This model is licensed under the Apache 2.0 license, allowing for extensive modification and distribution.

More Related APIs in Text To Speech