s2t small librispeech asr
facebookIntroduction
S2T-SMALL-LIBRISPEECH-ASR is a Speech to Text Transformer (S2T) model designed for automatic speech recognition (ASR). It is an end-to-end sequence-to-sequence transformer model trained with autoregressive cross-entropy loss to generate transcripts from speech features. The model is suitable for applications in end-to-end speech recognition.
Architecture
The S2T model employs a sequence-to-sequence transformer architecture. It processes input speech features through an encoder and generates text transcripts via an autoregressive decoder. The model is trained using standard autoregressive cross-entropy loss.
Training
The S2T-SMALL-LIBRISPEECH-ASR model is trained on the LibriSpeech ASR Corpus, which includes approximately 1000 hours of 16kHz English speech. The training data is pre-processed to extract 80-channel log mel-filter bank features, and texts are tokenized using SentencePiece with a vocabulary size of 10,000. SpecAugment is applied during training to enhance robustness.
Guide: Running Locally
-
Setup: Ensure you have Python and pip installed. Then, install the required packages using:
pip install torch transformers torchaudio sentencepiece
-
Load Model and Processor:
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration from datasets import load_dataset model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr") processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")
-
Prepare Dataset:
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
-
Generate Transcriptions:
input_features = processor(ds[0]["audio"]["array"], sampling_rate=16_000, return_tensors="pt").input_features generated_ids = model.generate(input_features=input_features) transcription = processor.batch_decode(generated_ids)
-
Cloud GPUs: For efficient processing, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The S2T-SMALL-LIBRISPEECH-ASR model is released under the MIT License.