seamless m4t v2 large

facebook

Introduction

SeamlessM4T v2 is a foundational multilingual and multimodal machine translation model developed by Meta for high-quality translation across speech and text in nearly 100 languages. It supports various tasks including speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation, as well as automatic speech recognition.

Architecture

SeamlessM4T v2 utilizes the UnitY2 architecture, which features hierarchical character-to-unit upsampling and non-autoregressive text-to-unit decoding. This architecture enhances both quality and inference speed compared to its predecessor, SeamlessM4T v1. The model supports 101 languages for speech input, 96 for text input/output, and 35 for speech output.

Training

Detailed evaluation metrics for SeamlessM4T-Large (v2) and SeamlessM4T-Medium (v1) are available, providing insights into the model's performance. The training and evaluation processes are documented, allowing reproduction of results using provided datasets such as FLEURS, CoVoST2, and CVSS-C.

Guide: Running Locally

  1. Install Dependencies:

    • Install the Hugging Face Transformers library and SentencePiece:
      pip install git+https://github.com/huggingface/transformers.git sentencepiece
      
  2. Run a Sample Code:

    • Use the Transformers library to generate speech samples from text and audio inputs. Here's a basic example for converting English text to Russian speech:
      from transformers import AutoProcessor, SeamlessM4Tv2Model
      import torchaudio
      
      processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
      model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")
      
      # Text to speech
      text_inputs = processor(text="Hello, my dog is cute", src_lang="eng", return_tensors="pt")
      audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
      
  3. Listen to Audio:

    • Either use IPython's Audio display within a Jupyter notebook or save the audio output as a .wav file using a library like scipy.
  4. Hardware Recommendations:

    • For optimal performance, especially for larger models like SeamlessM4T-Large, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

SeamlessM4T v2 is released under the CC BY-NC 4.0 license, which permits non-commercial use with proper attribution.

More Related APIs in Automatic Speech Recognition