unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur

facebook

Introduction

The UNIT_HIFIGAN_MHUBERT_VP_EN_ES_FR_IT3_400K_LAYER11_KM1000_LJ_DUR model, developed by Facebook AI, is a speech-to-speech translation model that utilizes fairseq's S2UT framework. This model supports Spanish-English translation and is trained on multiple datasets including mTEDx, CoVoST 2, Europarl-ST, and VoxPopuli.

Architecture

The model is built using the fairseq library, which is designed for sequence-to-sequence tasks. It incorporates the CodeHiFiGAN Vocoder for high-quality speech synthesis. The architecture enables the conversion of audio inputs into discrete units for translation, followed by speech synthesis.

Training

The model has been trained on datasets such as mTEDx, CoVoST 2, Europarl-ST, and VoxPopuli. These datasets provide a diverse range of speech samples, enhancing the model's ability to perform accurate speech-to-speech translation across different accents and speaking styles.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Dependencies: Ensure fairseq, torchaudio, and huggingface_hub libraries are installed in your Python environment.
  2. Download Model: Use the snapshot_download function from huggingface_hub to download the model files to a specified cache directory.
  3. Load Model: Utilize load_model_ensemble_and_task_from_hf_hub to load the model and configuration.
  4. Prepare Audio: Ensure your audio input is in 16000Hz mono channel format.
  5. Process Audio: Use the S2THubInterface to process and convert audio inputs to speech units.
  6. Synthesize Speech: Employ the VocoderHubInterface to synthesize audio from the speech units and play it using IPython.display.Audio.

Cloud GPUs

For optimal performance, especially with large datasets, consider using cloud-based GPU services such as AWS EC2, Google Cloud Platform, or Azure.

License

This model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This license allows for sharing and adapting the model for non-commercial purposes, provided appropriate credit is given.

More Related APIs in Text To Speech