parler tts mini v1.1

parler-tts

Introduction

Parler-TTS Mini v1.1 is a lightweight text-to-speech model capable of generating high-quality, natural-sounding speech. It allows control over features such as gender, background noise, speaking rate, pitch, and reverberation through simple text prompts. The model is based on a more advanced prompt tokenizer, allowing for multilingual training.

Architecture

Parler-TTS Mini v1.1 employs two tokenizers: one for prompts and another for descriptions. It supports 34 speakers with distinct characteristics, offering flexibility in voice generation. The model's tokenizer is derived from the unsloth/llama-2-7b tokenizer, enhancing its vocabulary and byte fallback capabilities.

Training

The model was trained on 45,000 hours of audio data from various datasets, including mls_eng and libritts, among others. It maintains the same training configuration as its predecessor, Parler-TTS Mini v1, with improvements made in tokenization for better multilingual handling.

Guide: Running Locally

To run Parler-TTS locally, follow these steps:

  1. Install the Library:

    pip install git+https://github.com/huggingface/parler-tts.git
    
  2. Load the Model:
    Use the transformers library to load both the model and tokenizers. Ensure that a compatible device like a GPU is available for faster processing.

  3. Generate Speech:

    import torch
    from parler_tts import ParlerTTSForConditionalGeneration
    from transformers import AutoTokenizer
    import soundfile as sf
    
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1.1").to(device)
    tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1.1")
    description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
    
    prompt = "Hey, how are you doing today?"
    description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch."
    
    input_ids = description_tokenizer(description, return_tensors="pt").input_ids.to(device)
    prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    
    generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
    audio_arr = generation.cpu().numpy().squeeze()
    sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
    
  4. Cloud GPUs:
    For optimal performance, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.

License

Parler-TTS Mini v1.1 is released under the Apache 2.0 license, permitting free use, distribution, and modification with attribution.

More Related APIs in Text To Speech