stt_en_fastconformer_tdt_large LLM Model

Introduction

The STT En FastConformer TDT Large model by NVIDIA is designed for automatic speech recognition (ASR) in English. It transcribes audio into text, outputting lowercase English without punctuation. The model is based on FastConformer architecture and is suitable for both commercial and non-commercial use.

Architecture

FastConformer is an enhanced version of the Conformer model, utilizing depthwise-separable convolutional downsampling. It is trained with a hybrid Transducer decoder and Connectionist Temporal Classification loss. The model uses a Google Sentencepiece Tokenizer with a vocabulary size of 1024.

Training

The model was trained using the NVIDIA NeMo Toolkit, leveraging around 24,000 hours of English speech from various datasets, such as LibriSpeech, Fisher Corpus, and Mozilla Common Voice. The training process involved a multitask setup with a focus on minimizing Word Error Rate (WER).

Guide: Running Locally

Set Up Environment:
- Install the NVIDIA NeMo toolkit.
- Ensure you have Python and the necessary dependencies installed.

Download Model:

Use the command:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecRNNTModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_tdt")

Prepare Audio Files:
- Ensure audio files are in .wav format, mono-channel at 16000 Hz.
Transcribe Audio:
- Use the transcribe function:
```
asr_model.transcribe(['audio_file.wav'])
```
Running on Cloud GPUs:
- Consider using cloud-based GPUs like NVIDIA A100 for enhanced performance and scalability.

License

The model is released under the CC-BY-4.0 license, allowing for adaptation and distribution with appropriate credit. More details can be found at CC-BY-4.0 License.

More Related APIs in Automatic Speech Recognition