xm_transformer_unity_en hk
facebookIntroduction
The XM_TRANSFORMER_UNITY_EN-HK
is a speech-to-speech translation model designed to translate English to Hokkien using a two-pass decoder (UnitY) framework from the Fairseq library. It leverages both supervised and weakly supervised data from the TED and Audiobook domains.
Architecture
The model utilizes the Fairseq library and is trained on the MuST-C dataset. It incorporates speech synthesis capabilities using the facebook/unit_hifigan_HK_layer12.km2500_frame_TAT-TTS
model.
Training
The training process involves the use of supervised data in the TED domain and weakly supervised data from both the TED and Audiobook domains. For detailed training information, refer to the research publication.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Required Libraries:
- Ensure you have Python and pip installed.
- Install Fairseq, Torchaudio, and Hugging Face Hub using pip.
-
Load the Model:
- Utilize the
load_model_ensemble_and_task_from_hf_hub
function from Fairseq to load the model.
- Utilize the
-
Prepare Input Audio:
- Convert your audio input to a 16,000Hz mono channel format.
-
Generate Predictions:
- Use the
S2THubInterface
to process the audio and extract predictions.
- Use the
-
Perform Speech Synthesis:
- Download and configure the vocoder model using
CodeHiFiGANVocoder
. - Generate synthesized speech using the vocoder.
- Download and configure the vocoder model using
Suggestion: Cloud GPUs
For better performance, consider using cloud GPU services such as AWS, Google Cloud, or Microsoft Azure to handle the computational demands of the model.
License
This model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (cc-by-nc-4.0) license.