xm_transformer_s2ut_en hk
facebookIntroduction
The XM_TRANSFORMER_S2UT_EN-HK
is a speech-to-speech translation model designed to convert English audio into Hokkien using a single-pass decoder. This model is developed with data from TED talks and audiobooks, utilizing both supervised and weakly supervised learning.
Architecture
The model is developed using the Fairseq library and supports audio-to-audio tasks, specifically speech-to-speech translation. It employs the S2UT
(Single-pass Decoder) architecture, which enables efficient processing of audio inputs to generate translated speech.
Training
The model was trained using a combination of supervised data from the TED domain and weakly supervised data from both TED and Audiobook domains. For details on the training methodology, you can refer to the research publication here.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Prerequisites: Ensure you have Python and the necessary libraries installed, including
fairseq
,torchaudio
, andhuggingface_hub
. -
Download Model: Use the
huggingface_hub
to download the model and its dependencies to your local cache directory.from huggingface_hub import snapshot_download cache_dir = snapshot_download("facebook/xm_transformer_s2ut_en-hk")
-
Load and Configure Model: Utilize the Fairseq library to load the model ensemble and task configuration from the Hugging Face model hub.
from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub models, cfg, task = load_model_ensemble_and_task_from_hf_hub( "facebook/xm_transformer_s2ut_en-hk", arg_overrides={"config_yaml": "config.yaml", "task": "speech_to_text"}, )
-
Prepare Audio: Ensure your input audio is in 16000Hz mono channel format and load it using
torchaudio
.import torchaudio audio, _ = torchaudio.load("/path/to/an/audio/file")
-
Generate Translation: Use the task's generator to obtain the translated speech units.
-
Speech Synthesis: Synthesize the translated speech using the
CodeHiFiGANVocoder
model.from fairseq.models.text_to_speech import CodeHiFiGANVocoder vocoder = CodeHiFiGANVocoder(...)
-
Play Audio: Use IPython to play the generated audio.
import IPython.display as ipd ipd.Audio(wav, rate=sr)
For optimal performance, consider using cloud GPUs to handle the computational load of processing and synthesizing audio inputs.
License
The model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. This allows for non-commercial use with appropriate attribution.