xm_transformer_s2ut_en hk

facebook

Introduction

The XM_TRANSFORMER_S2UT_EN-HK is a speech-to-speech translation model designed to convert English audio into Hokkien using a single-pass decoder. This model is developed with data from TED talks and audiobooks, utilizing both supervised and weakly supervised learning.

Architecture

The model is developed using the Fairseq library and supports audio-to-audio tasks, specifically speech-to-speech translation. It employs the S2UT (Single-pass Decoder) architecture, which enables efficient processing of audio inputs to generate translated speech.

Training

The model was trained using a combination of supervised data from the TED domain and weakly supervised data from both TED and Audiobook domains. For details on the training methodology, you can refer to the research publication here.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Prerequisites: Ensure you have Python and the necessary libraries installed, including fairseq, torchaudio, and huggingface_hub.

  2. Download Model: Use the huggingface_hub to download the model and its dependencies to your local cache directory.

    from huggingface_hub import snapshot_download
    cache_dir = snapshot_download("facebook/xm_transformer_s2ut_en-hk")
    
  3. Load and Configure Model: Utilize the Fairseq library to load the model ensemble and task configuration from the Hugging Face model hub.

    from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
    models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
        "facebook/xm_transformer_s2ut_en-hk",
        arg_overrides={"config_yaml": "config.yaml", "task": "speech_to_text"},
    )
    
  4. Prepare Audio: Ensure your input audio is in 16000Hz mono channel format and load it using torchaudio.

    import torchaudio
    audio, _ = torchaudio.load("/path/to/an/audio/file")
    
  5. Generate Translation: Use the task's generator to obtain the translated speech units.

  6. Speech Synthesis: Synthesize the translated speech using the CodeHiFiGANVocoder model.

    from fairseq.models.text_to_speech import CodeHiFiGANVocoder
    vocoder = CodeHiFiGANVocoder(...)
    
  7. Play Audio: Use IPython to play the generated audio.

    import IPython.display as ipd
    ipd.Audio(wav, rate=sr)
    

For optimal performance, consider using cloud GPUs to handle the computational load of processing and synthesizing audio inputs.

License

The model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. This allows for non-commercial use with appropriate attribution.

More Related APIs in Audio To Audio