asr wav2vec2 commonvoice en

speechbrain

Introduction

SpeechBrain's ASR-WAV2VEC2-COMMONVOICE-EN is a pretrained model for automatic speech recognition, focusing on the English language. It utilizes the CommonVoice dataset and is designed to operate as part of the SpeechBrain toolkit, leveraging PyTorch and other modern machine learning frameworks.

Architecture

The ASR system consists of two primary components:

  • Tokenizer: A unigram tokenizer that converts words into subword units, trained on CommonVoice English train transcriptions.
  • Acoustic Model: Combines a pretrained Wav2Vec 2.0 model with two DNN layers, fine-tuned on CommonVoice English, followed by a CTC decoder. The model processes audio sampled at 16 kHz, normalizing input audio as necessary during transcription.

Training

Training the model involves using the SpeechBrain toolkit. Steps include:

  1. Cloning the SpeechBrain repository.
  2. Installing dependencies from requirements.txt and setting up the project.
  3. Running the training script located in recipes/CommonVoice/ASR/seq2seq with appropriate hyperparameters and data.

Training results, including models and logs, are available via a shared Google Drive link.

Guide: Running Locally

To run the model locally:

  1. Install Dependencies:

    pip install speechbrain transformers
    
  2. Transcribe Audio:

    from speechbrain.inference.ASR import EncoderDecoderASR
    
    asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-en", savedir="pretrained_models/asr-wav2vec2-commonvoice-en")
    asr_model.transcribe_file("speechbrain/asr-wav2vec2-commonvoice-en/example.wav")
    
  3. GPU Inference: Use run_opts={"device":"cuda"} for GPU acceleration.

  4. Parallel Inference: Refer to the provided Colab notebook for batch processing instructions.

For optimal performance, consider using cloud GPUs such as those offered by Google Cloud, AWS, or Azure.

License

The ASR-WAV2VEC2-COMMONVOICE-EN model is licensed under the Apache-2.0 License, allowing for broad use and modification, provided that compliance with the license terms is maintained.

More Related APIs in Automatic Speech Recognition