asr wav2vec2 commonvoice en
speechbrainIntroduction
SpeechBrain's ASR-WAV2VEC2-COMMONVOICE-EN is a pretrained model for automatic speech recognition, focusing on the English language. It utilizes the CommonVoice dataset and is designed to operate as part of the SpeechBrain toolkit, leveraging PyTorch and other modern machine learning frameworks.
Architecture
The ASR system consists of two primary components:
- Tokenizer: A unigram tokenizer that converts words into subword units, trained on CommonVoice English train transcriptions.
- Acoustic Model: Combines a pretrained Wav2Vec 2.0 model with two DNN layers, fine-tuned on CommonVoice English, followed by a CTC decoder. The model processes audio sampled at 16 kHz, normalizing input audio as necessary during transcription.
Training
Training the model involves using the SpeechBrain toolkit. Steps include:
- Cloning the SpeechBrain repository.
- Installing dependencies from
requirements.txt
and setting up the project. - Running the training script located in
recipes/CommonVoice/ASR/seq2seq
with appropriate hyperparameters and data.
Training results, including models and logs, are available via a shared Google Drive link.
Guide: Running Locally
To run the model locally:
-
Install Dependencies:
pip install speechbrain transformers
-
Transcribe Audio:
from speechbrain.inference.ASR import EncoderDecoderASR asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-en", savedir="pretrained_models/asr-wav2vec2-commonvoice-en") asr_model.transcribe_file("speechbrain/asr-wav2vec2-commonvoice-en/example.wav")
-
GPU Inference: Use
run_opts={"device":"cuda"}
for GPU acceleration. -
Parallel Inference: Refer to the provided Colab notebook for batch processing instructions.
For optimal performance, consider using cloud GPUs such as those offered by Google Cloud, AWS, or Azure.
License
The ASR-WAV2VEC2-COMMONVOICE-EN model is licensed under the Apache-2.0 License, allowing for broad use and modification, provided that compliance with the license terms is maintained.