xm_transformer_s2ut_hk en LLM Model

Introduction

The XM_TRANSFORMER_S2UT_HK-EN is a speech-to-speech translation model that translates Hokkien to English. Developed using Fairseq, this model employs a single-pass decoder (S2UT) and is trained on both supervised and weakly supervised data from sources like TED, TAT, and Hokkien dramas. The model is available on Hugging Face under the cc-by-nc-4.0 license.

Architecture

The model utilizes Fairseq's framework, specifically designed for audio-to-audio tasks, and integrates speech synthesis capabilities with the facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur synthesizer. It processes audio inputs of 16000Hz mono channel, translating them into desired output formats.

Training

The training was conducted on a mix of data sources:

Supervised Data: Utilized resources from TED talks, TAT corpus, and drama domains.
Weakly Supervised Data: Focused on drama domains for additional training support.

For detailed insights into the training process, refer to the research publication.

Guide: Running Locally

To run the model locally, follow these steps:

Install Prerequisites:
- Ensure Python is installed.
- Install necessary libraries: fairseq, torchaudio, and IPython.

Download Model:

Use the Hugging Face Hub to download the model:

from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
    "facebook/xm_transformer_s2ut_hk-en",
    arg_overrides={"config_yaml": "config.yaml", "task": "speech_to_text"},
    cache_dir=cache_dir
)

Load Audio File:
- Ensure your audio file is in 16000Hz mono channel format.
- Load it using torchaudio:
```
audio, _ = torchaudio.load("/path/to/an/audio/file")
```
Make Predictions:
- Process the audio through the model to get predictions and synthesize speech.
Environment:
- For efficient processing, consider using cloud services like AWS, Google Cloud, or Azure with GPU support.

License

The model is released under the cc-by-nc-4.0 license, allowing for non-commercial use with appropriate attribution. This license permits sharing and adaptation of the material as long as credit is given and no commercial use is made of the work.

More Related APIs in Audio To Audio