xm_transformer_600m en_tr multi_domain
facebookIntroduction
The XM_TRANSFORMER_600M-EN_TR-MULTI_DOMAIN
model is a speech-to-text and text-to-speech translation model developed by Facebook AI, utilizing Fairseq's S2T framework. It is designed to handle English to Turkish translations, leveraging large datasets for multilingual speech translation.
Architecture
The model is based on the W2V2-Transformer architecture and is trained using Fairseq's tools. It supports audio-to-audio tasks, particularly focusing on speech-to-speech translation. The model is trained on datasets such as MuST-C, CoVoST 2, Multilingual LibriSpeech, Common Voice v7, and CCMatrix.
Training
The model has been fine-tuned from pretrained models to efficiently handle multilingual speech translation tasks. It incorporates advanced techniques for speech synthesis and translation to achieve high performance in the English-Turkish domain.
Guide: Running Locally
To run the model locally, follow these steps:
- Install Dependencies: Ensure you have Python and the required libraries, including Fairseq and Torchaudio.
- Load the Model: Use the
fairseq.checkpoint_utils
to load the model ensemble and task configuration from the Hugging Face Hub. - Prepare Audio Input: Convert your audio file to a 16000Hz mono channel format using Torchaudio.
- Get Predictions: Use
S2THubInterface
for speech-to-text predictions andTTSHubInterface
for speech synthesis. - Playback: Use IPython to playback the synthesized audio.
For optimal performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The model and its components are subject to the licensing terms provided by Hugging Face and associated libraries, which should be reviewed before use.