tts_transformer en 200_speaker cv4

facebook

Introduction

The TTS_TRANSFORMER-EN-200_SPEAKER-CV4 is a text-to-speech (TTS) model developed by Facebook, utilizing the fairseq library. It is designed to synthesize speech in English, supporting 200 distinct male and female voices by randomly selecting a speaker. The model is trained on the Common Voice v4 dataset.

Architecture

This model employs a Transformer architecture for text-to-speech tasks. It leverages fairseq's S^2 framework, as detailed in the associated research papers (arXiv:1809.08895 and arXiv:2109.06912). The model's architecture allows for multi-speaker functionality and is equipped with a vocoder option using Hifi-GAN.

Training

The TTS_TRANSFORMER-EN-200_SPEAKER-CV4 model was trained on the Mozilla Common Voice v4 dataset, which includes a diverse range of English speech samples. The training process utilizes fairseq's capabilities to enable scalable and integrable speech synthesis.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Dependencies: Ensure you have Python and the necessary libraries installed, including fairseq and IPython. You can install fairseq using pip:

    pip install fairseq
    
  2. Download the Model: Use the load_model_ensemble_and_task_from_hf_hub function from fairseq's checkpoint_utils to download and load the model:

    from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
    
  3. Set Up the Model and Generator:

    models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
        "facebook/tts_transformer-en-200_speaker-cv4",
        arg_overrides={"vocoder": "hifigan", "fp16": False}
    )
    model = models[0]
    
  4. Generate Speech:

    from fairseq.models.text_to_speech.hub_interface import TTSHubInterface
    import IPython.display as ipd
    
    TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg)
    generator = task.build_generator(model, cfg)
    
    text = "Hello, this is a test run."
    sample = TTSHubInterface.get_model_input(task, text)
    wav, rate = TTSHubInterface.get_prediction(task, model, generator, sample)
    
    ipd.Audio(wav, rate=rate)
    

For enhanced performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

The model and its associated code are subject to the terms outlined by Facebook and fairseq. For detailed information, refer to the respective repositories and documentation.

More Related APIs in Text To Speech