parakeet rnnt 1.1b

nvidia

Introduction

The Parakeet-RNNT-1.1B is an Automatic Speech Recognition (ASR) model designed to transcribe English speech into lowercase text. Developed by NVIDIA NeMo and Suno.ai, it is an XXL version of the FastConformer Transducer model, featuring approximately 1.1 billion parameters. The model leverages various datasets for training and demonstrates impressive performance across different speech recognition tasks.

Architecture

The Parakeet-RNNT-1.1B employs the FastConformer architecture, an optimized version of the Conformer model. It utilizes depthwise-separable convolutional downsampling for efficiency and is trained using a multitask setup with a Transducer decoder (RNNT) loss. The NeMo toolkit is used for training and inference, providing flexibility for further fine-tuning.

Training

Training is conducted over several hundred epochs using the NeMo toolkit, with models trained on a large corpus comprising 64,000 hours of English speech. This includes both private datasets and 24,000 hours from public datasets such as Librispeech, Fisher Corpus, and Mozilla Common Voice. The performance is measured in terms of Word Error Rate (WER), with the model achieving competitive results across various benchmarks.

Guide: Running Locally

Basic Steps

  1. Install Dependencies: Ensure you have the latest version of PyTorch and install NVIDIA NeMo.

    pip install nemo_toolkit['all']
    
  2. Load the Model: Instantiate the model using the NeMo toolkit.

    import nemo.collections.asr as nemo_asr
    asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="nvidia/parakeet-rnnt-1.1b")
    
  3. Prepare Audio Files: Ensure the audio files are in 16000 Hz mono-channel WAV format.

  4. Transcribe Audio: Use the model to transcribe audio files.

    asr_model.transcribe(['<audio_file_path>'])
    

Cloud GPUs

To enhance performance, especially for extensive datasets, consider using cloud GPU resources such as AWS EC2, Google Cloud Platform, or Azure.

License

The Parakeet-RNNT-1.1B model is licensed under the CC-BY-4.0 license. This allows for usage, distribution, and modification as long as appropriate credit is given to the creators. More details can be found at Creative Commons License.

More Related APIs in Automatic Speech Recognition