tts_transformer en ljspeech

facebook

Introduction

The TTS_TRANSFORMER-EN-LJSPEECH is a text-to-speech model developed using the fairseq library. It features a single-speaker female voice and is trained on the LJSpeech dataset. The model provides a scalable and integrable speech synthesis solution.

Architecture

This model utilizes a Transformer architecture for text-to-speech tasks, as detailed in the papers available on arXiv and GitHub. The architecture supports English language audio synthesis and is part of the fairseq S^2 framework.

Training

The TTS_TRANSFORMER-EN-LJSPEECH was trained on the LJSpeech dataset, which contains high-quality recordings of a single female speaker in English. The training process focuses on generating realistic and natural-sounding speech.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install fairseq: Ensure that you have the fairseq library installed. It can be installed via pip:

    pip install fairseq
    
  2. Load the Model: Use the provided Python script to load the model and generate speech:

    from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
    from fairseq.models.text_to_speech.hub_interface import TTSHubInterface
    import IPython.display as ipd
    
    models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
        "facebook/tts_transformer-en-ljspeech",
        arg_overrides={"vocoder": "hifigan", "fp16": False}
    )
    model = models[0]
    TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg)
    generator = task.build_generator(model, cfg)
    
    text = "Hello, this is a test run."
    
    sample = TTSHubInterface.get_model_input(task, text)
    wav, rate = TTSHubInterface.get_prediction(task, model, generator, sample)
    
    ipd.Audio(wav, rate=rate)
    
  3. Run the Code: Execute the script in an environment that supports IPython, such as Jupyter Notebook, to listen to the generated audio.

  4. Cloud GPUs: For improved performance, especially with larger datasets or more complex tasks, consider using cloud-based GPU services such as AWS, Google Cloud, or Azure.

License

The model and code are available under the terms specified in the fairseq GitHub repository, which should be consulted for specific licensing details.

More Related APIs in Text To Speech