xm_transformer_unity_en hk

facebook

Introduction

The XM_TRANSFORMER_UNITY_EN-HK is a speech-to-speech translation model designed to translate English to Hokkien using a two-pass decoder (UnitY) framework from the Fairseq library. It leverages both supervised and weakly supervised data from the TED and Audiobook domains.

Architecture

The model utilizes the Fairseq library and is trained on the MuST-C dataset. It incorporates speech synthesis capabilities using the facebook/unit_hifigan_HK_layer12.km2500_frame_TAT-TTS model.

Training

The training process involves the use of supervised data in the TED domain and weakly supervised data from both the TED and Audiobook domains. For detailed training information, refer to the research publication.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Required Libraries:

    • Ensure you have Python and pip installed.
    • Install Fairseq, Torchaudio, and Hugging Face Hub using pip.
  2. Load the Model:

    • Utilize the load_model_ensemble_and_task_from_hf_hub function from Fairseq to load the model.
  3. Prepare Input Audio:

    • Convert your audio input to a 16,000Hz mono channel format.
  4. Generate Predictions:

    • Use the S2THubInterface to process the audio and extract predictions.
  5. Perform Speech Synthesis:

    • Download and configure the vocoder model using CodeHiFiGANVocoder.
    • Generate synthesized speech using the vocoder.

Suggestion: Cloud GPUs

For better performance, consider using cloud GPU services such as AWS, Google Cloud, or Microsoft Azure to handle the computational demands of the model.

License

This model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (cc-by-nc-4.0) license.

More Related APIs in Audio To Audio