stt_zh_conformer_transducer_large

nvidia

Introduction

The NVIDIA Conformer-Transducer Large model is designed for Automatic Speech Recognition (ASR) in Mandarin. It is a large-scale model with approximately 120 million parameters, capable of transcribing Mandarin speech into text.

Architecture

The model is based on the Conformer-Transducer architecture, an autoregressive variant of the Conformer model. This architecture employs Transducer loss/decoding instead of the more common CTC Loss, enhancing its ability to transcribe speech accurately. Detailed information about the model can be found in the NVIDIA documentation.

Training

Training of the model was conducted using the NVIDIA NeMo toolkit over several hundred epochs. The training utilized the AISHELL-2 dataset, which consists of Mandarin speech data. The process was guided by example scripts and configuration files available in the NeMo toolkit repository.

Guide: Running Locally

  1. Install NVIDIA NeMo: Ensure you have the latest version of PyTorch installed, then run:

    pip install nemo_toolkit['all']
    
  2. Load the Model: Instantiate the model in Python using NeMo:

    import nemo.collections.asr as nemo_asr
    asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/stt_zh_conformer_transducer_large")
    
  3. Transcribe Audio: Transcribe an audio file by passing its path to the model:

    asr_model.transcribe([PATH_TO_THE_AUDIO])
    
  4. Transcribe Multiple Files: Use the provided script to transcribe a directory of audio files:

    python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
    pretrained_name="nvidia/stt_zh_conformer_transducer_large" \
    audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
    
  5. Input Requirements: The model requires 16,000 KHz mono-channel audio in wav format.

  6. Cloud GPUs: For large-scale or resource-intensive tasks, consider using cloud-based GPUs such as AWS EC2 with NVIDIA GPUs or Azure NV-series instances.

License

The model is distributed under the Creative Commons Attribution 4.0 International (CC-BY-4.0) license. By using the model, you agree to the terms outlined in the CC-BY-4.0 license.

More Related APIs in Automatic Speech Recognition