tts_transformer ru cv7_css10

facebook

Introduction

The TTS_TRANSFORMER-RU-CV7_CSS10 model is a text-to-speech transformer model developed using the Fairseq framework by Meta AI. It is designed for generating Russian spoken audio from text using a single-speaker male voice. The model is pre-trained on the Common Voice v7 dataset and fine-tuned on the CSS10 dataset.

Architecture

The model leverages the transformer architecture, which is particularly effective for sequence-to-sequence tasks like text-to-speech. It integrates Fairseq's S^2 toolkit for scalable and integrable speech synthesis, with additional support from the HiFi-GAN vocoder for improved audio quality.

Training

The model was pre-trained using the Common Voice v7 dataset, which provides a large collection of diverse voice recordings. It was subsequently fine-tuned on the CSS10 dataset, which focuses on a single-speaker setup to enhance the quality of the generated voice for Russian text.

Guide: Running Locally

To run this model locally, follow these steps:

  1. Install Fairseq: Ensure you have Fairseq installed in your Python environment. You can do this using pip:

    pip install fairseq
    
  2. Load the Model: Use the provided Python script to load the model and generate audio.

    from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
    from fairseq.models.text_to_speech.hub_interface import TTSHubInterface
    import IPython.display as ipd
    
    models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
        "facebook/tts_transformer-ru-cv7_css10",
        arg_overrides={"vocoder": "hifigan", "fp16": False}
    )
    model = models[0]
    TTSHubInterface.update_cfg_with_data_cfg(cfg, task.data_cfg)
    generator = task.build_generator(model, cfg)
    
    text = "Здравствуйте, это пробный запуск."
    
    sample = TTSHubInterface.get_model_input(task, text)
    wav, rate = TTSHubInterface.get_prediction(task, model, generator, sample)
    
    ipd.Audio(wav, rate=rate)
    
  3. Cloud GPUs: For faster inference, consider using cloud-based GPUs such as those offered by AWS, Google Cloud, or Azure. This can significantly speed up the processing time for generating audio from text.

License

The model and code are subject to the licensing terms provided by Meta AI and the Fairseq framework. Users should review these terms to ensure compliance with usage rights and restrictions.

More Related APIs in Text To Speech