s2t medium mustc multilingual st LLM Model

Introduction

The s2t-medium-mustc-multilingual-st is a Speech to Text Transformer (S2T) model developed for end-to-end Multilingual Speech Translation (ST). It is designed to perform Automatic Speech Recognition (ASR) and Speech Translation. The model is built using a transformer-based seq2seq architecture and is capable of translating English speech into multiple languages, including French and German.

Architecture

The S2T model employs a transformer-based sequence-to-sequence (encoder-decoder) architecture. It utilizes a convolutional downsampler to shorten speech inputs by three-quarters before feeding them into the encoder. This model generates transcripts and translations autoregressively, utilizing standard autoregressive cross-entropy loss during training.

Training

Training Data

The model is trained on the MuST-C dataset, a comprehensive multilingual speech translation corpus. The dataset consists of hundreds of hours of audio from English TED Talks, aligned with manual transcriptions and translations in several languages.

Training Procedure

Preprocessing: Speech data is pre-processed to extract 80-channel log mel-filter bank features using PyKaldi or torchaudio, followed by CMVN. Texts are lowercased and tokenized with SentencePiece with a vocabulary size of 10,000.
Training: The model uses SpecAugment for data augmentation and is trained with cross-entropy loss. The encoder is pre-trained for multilingual ASR, and for multilingual models, a target language ID token is used as the BOS token.

Guide: Running Locally

Install Dependencies:

Use extra speech dependencies:

pip install transformers"[speech, sentencepiece]"

Alternatively, install separately:
```
pip install torchaudio sentencepiece
```

Load the Model:

from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")

Prepare the Data:

import soundfile as sf
from datasets import load_dataset

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)

Generate Translations:

inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
generated_ids = model.generate(
    input_ids=inputs["input_features"],
    attention_mask=inputs["attention_mask"],
    forced_bos_token_id=processor.tokenizer.lang_code_to_id["fr"]
)
translation_fr = processor.batch_decode(generated_ids)

Cloud GPU Recommendations: For efficient processing, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The model and its components are released under the MIT License, allowing for broad usage and adaptation.

More Related APIs in Automatic Speech Recognition