s2t small librispeech asr

facebook

Introduction

S2T-SMALL-LIBRISPEECH-ASR is a Speech to Text Transformer (S2T) model designed for automatic speech recognition (ASR). It is an end-to-end sequence-to-sequence transformer model trained with autoregressive cross-entropy loss to generate transcripts from speech features. The model is suitable for applications in end-to-end speech recognition.

Architecture

The S2T model employs a sequence-to-sequence transformer architecture. It processes input speech features through an encoder and generates text transcripts via an autoregressive decoder. The model is trained using standard autoregressive cross-entropy loss.

Training

The S2T-SMALL-LIBRISPEECH-ASR model is trained on the LibriSpeech ASR Corpus, which includes approximately 1000 hours of 16kHz English speech. The training data is pre-processed to extract 80-channel log mel-filter bank features, and texts are tokenized using SentencePiece with a vocabulary size of 10,000. SpecAugment is applied during training to enhance robustness.

Guide: Running Locally

  1. Setup: Ensure you have Python and pip installed. Then, install the required packages using:

    pip install torch transformers torchaudio sentencepiece
    
  2. Load Model and Processor:

    from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
    from datasets import load_dataset
    
    model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
    processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")
    
  3. Prepare Dataset:

    ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
    
  4. Generate Transcriptions:

    input_features = processor(ds[0]["audio"]["array"], sampling_rate=16_000, return_tensors="pt").input_features
    generated_ids = model.generate(input_features=input_features)
    transcription = processor.batch_decode(generated_ids)
    
  5. Cloud GPUs: For efficient processing, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The S2T-SMALL-LIBRISPEECH-ASR model is released under the MIT License.

More Related APIs in Automatic Speech Recognition