seamless m4t v2 large LLM Model

Introduction

SeamlessM4T v2 is a foundational multilingual and multimodal machine translation model developed by Meta for high-quality translation across speech and text in nearly 100 languages. It supports various tasks including speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation, as well as automatic speech recognition.

Architecture

SeamlessM4T v2 utilizes the UnitY2 architecture, which features hierarchical character-to-unit upsampling and non-autoregressive text-to-unit decoding. This architecture enhances both quality and inference speed compared to its predecessor, SeamlessM4T v1. The model supports 101 languages for speech input, 96 for text input/output, and 35 for speech output.

Training

Detailed evaluation metrics for SeamlessM4T-Large (v2) and SeamlessM4T-Medium (v1) are available, providing insights into the model's performance. The training and evaluation processes are documented, allowing reproduction of results using provided datasets such as FLEURS, CoVoST2, and CVSS-C.

Guide: Running Locally

Install Dependencies:
- Install the Hugging Face Transformers library and SentencePiece:
```
pip install git+https://github.com/huggingface/transformers.git sentencepiece
```

Run a Sample Code:

Use the Transformers library to generate speech samples from text and audio inputs. Here's a basic example for converting English text to Russian speech:

from transformers import AutoProcessor, SeamlessM4Tv2Model
import torchaudio

processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")

# Text to speech
text_inputs = processor(text="Hello, my dog is cute", src_lang="eng", return_tensors="pt")
audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()

Listen to Audio:
- Either use IPython's Audio display within a Jupyter notebook or save the audio output as a .wav file using a library like scipy.
Hardware Recommendations:
- For optimal performance, especially for larger models like SeamlessM4T-Large, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

SeamlessM4T v2 is released under the CC BY-NC 4.0 license, which permits non-commercial use with proper attribution.

More Related APIs in Automatic Speech Recognition