asr transformer aishell

speechbrain

Introduction

The ASR-Transformer-AISHELL is an end-to-end automatic speech recognition (ASR) system for Mandarin Chinese, developed as part of the SpeechBrain project. It utilizes components such as tokenization and acoustic modeling with transformer architectures to perform speech-to-text tasks.

Architecture

The ASR system consists of two main components:

  • A Tokenizer that converts words into subword units using a unigram model trained on LibriSpeech transcriptions.
  • An Acoustic Model combining a transformer encoder with a joint decoder integrating Connectionist Temporal Classification (CTC) and transformer-based decoding.

The system processes audio sampled at 16kHz and is capable of normalizing audio inputs automatically.

Training

The model was trained using SpeechBrain with specific configurations and can be retrained from scratch:

  1. Clone the SpeechBrain repository:
    git clone https://github.com/speechbrain/speechbrain/
    
  2. Install dependencies:
    cd speechbrain
    pip install -r requirements.txt
    pip install -e .
    
  3. Execute the training script:
    cd recipes/AISHELL-1/ASR/transformer/
    python train.py hparams/train_ASR_transformer.yaml --data_folder=your_data_folder
    

Training data and results are available on Google Drive.

Guide: Running Locally

  1. Install SpeechBrain:
    pip install speechbrain
    
  2. Transcribe Audio Files:
    from speechbrain.inference.ASR import EncoderDecoderASR
    asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-aishell", savedir="pretrained_models/asr-transformer-aishell")
    asr_model.transcribe_file("your_audio_file.flac")
    
  3. Inference on GPU: Add run_opts={"device":"cuda"} to use GPU during inference:
    asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-aishell", savedir="pretrained_models/asr-transformer-aishell", run_opts={"device":"cuda"})
    

For enhanced performance, consider using cloud GPUs such as AWS EC2 with NVIDIA V100 or A100 instances.

License

The ASR-Transformer-AISHELL is licensed under the Apache 2.0 license, allowing for broad use and modification within the terms specified.

More Related APIs in Automatic Speech Recognition