parakeet tdt_ctc 110m
nvidiaIntroduction
The parakeet-tdt_ctc-110m is an Automatic Speech Recognition (ASR) model developed by NVIDIA and Suno.ai. It provides transcriptions of English speech, including punctuation and capitalization. The model is based on a Hybrid FastConformer architecture and is designed to transcribe up to 20 minutes of audio in one pass.
Architecture
This model employs a Hybrid FastConformer-TDT-CTC architecture, an optimized version of the Conformer model. It features depthwise-separable convolutional downsampling to enhance efficiency. The FastConformer architecture allows for full attention, contributing to high-speed transcription performance.
Training
Training was conducted using the NeMo toolkit, fine-tuning the model for 20,000 steps. The training dataset includes 36,000 hours of English speech, sourced from both private and public datasets such as Librispeech, Fisher Corpus, and VCTK. The training process involved using a script and configuration files provided by NeMo.
Guide: Running Locally
-
Setup Environment: Install PyTorch and NVIDIA NeMo toolkit.
pip install nemo_toolkit['all']
-
Instantiate Model: Load the model using NeMo.
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt_ctc-110m")
-
Download Sample Audio:
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
-
Transcribe Audio: Use the model to transcribe audio.
asr_model.transcribe(['2086-149220-0033.wav'])
-
Transcribe Multiple Files: Use the provided script for batch transcription.
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py --pretrained_name="nvidia/parakeet-tdt_ctc-110m" --audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
Cloud GPUs: Consider using cloud services with NVIDIA A100 GPUs for optimal performance.
License
The model is licensed under the CC-BY-4.0 license. Users must comply with the terms and conditions outlined in the CC-BY-4.0 license.