stt_en_fastconformer_tdt_large
nvidiaIntroduction
The STT En FastConformer TDT Large model by NVIDIA is designed for automatic speech recognition (ASR) in English. It transcribes audio into text, outputting lowercase English without punctuation. The model is based on FastConformer architecture and is suitable for both commercial and non-commercial use.
Architecture
FastConformer is an enhanced version of the Conformer model, utilizing depthwise-separable convolutional downsampling. It is trained with a hybrid Transducer decoder and Connectionist Temporal Classification loss. The model uses a Google Sentencepiece Tokenizer with a vocabulary size of 1024.
Training
The model was trained using the NVIDIA NeMo Toolkit, leveraging around 24,000 hours of English speech from various datasets, such as LibriSpeech, Fisher Corpus, and Mozilla Common Voice. The training process involved a multitask setup with a focus on minimizing Word Error Rate (WER).
Guide: Running Locally
-
Set Up Environment:
- Install the NVIDIA NeMo toolkit.
- Ensure you have Python and the necessary dependencies installed.
-
Download Model:
- Use the command:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecRNNTModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_tdt")
- Use the command:
-
Prepare Audio Files:
- Ensure audio files are in
.wav
format, mono-channel at 16000 Hz.
- Ensure audio files are in
-
Transcribe Audio:
- Use the transcribe function:
asr_model.transcribe(['audio_file.wav'])
- Use the transcribe function:
-
Running on Cloud GPUs:
- Consider using cloud-based GPUs like NVIDIA A100 for enhanced performance and scalability.
License
The model is released under the CC-BY-4.0 license, allowing for adaptation and distribution with appropriate credit. More details can be found at CC-BY-4.0 License.