s2t medium mustc multilingual st
facebookIntroduction
The s2t-medium-mustc-multilingual-st
is a Speech to Text Transformer (S2T) model developed for end-to-end Multilingual Speech Translation (ST). It is designed to perform Automatic Speech Recognition (ASR) and Speech Translation. The model is built using a transformer-based seq2seq architecture and is capable of translating English speech into multiple languages, including French and German.
Architecture
The S2T model employs a transformer-based sequence-to-sequence (encoder-decoder) architecture. It utilizes a convolutional downsampler to shorten speech inputs by three-quarters before feeding them into the encoder. This model generates transcripts and translations autoregressively, utilizing standard autoregressive cross-entropy loss during training.
Training
Training Data
The model is trained on the MuST-C dataset, a comprehensive multilingual speech translation corpus. The dataset consists of hundreds of hours of audio from English TED Talks, aligned with manual transcriptions and translations in several languages.
Training Procedure
- Preprocessing: Speech data is pre-processed to extract 80-channel log mel-filter bank features using PyKaldi or torchaudio, followed by CMVN. Texts are lowercased and tokenized with SentencePiece with a vocabulary size of 10,000.
- Training: The model uses SpecAugment for data augmentation and is trained with cross-entropy loss. The encoder is pre-trained for multilingual ASR, and for multilingual models, a target language ID token is used as the BOS token.
Guide: Running Locally
-
Install Dependencies:
- Use extra speech dependencies:
pip install transformers"[speech, sentencepiece]"
- Alternatively, install separately:
pip install torchaudio sentencepiece
- Use extra speech dependencies:
-
Load the Model:
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-mustc-multilingual-st") processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
-
Prepare the Data:
import soundfile as sf from datasets import load_dataset def map_to_array(batch): speech, _ = sf.read(batch["file"]) batch["speech"] = speech return batch ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") ds = ds.map(map_to_array)
-
Generate Translations:
inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt") generated_ids = model.generate( input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"], forced_bos_token_id=processor.tokenizer.lang_code_to_id["fr"] ) translation_fr = processor.batch_decode(generated_ids)
-
Cloud GPU Recommendations: For efficient processing, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The model and its components are released under the MIT License, allowing for broad usage and adaptation.