stt_ru_conformer_transducer_large
nvidiaIntroduction
The NVIDIA STT RU Conformer Transducer Large model is designed for automatic speech recognition (ASR) of Russian language audio. It transcribes audio into lowercase Cyrillic text and is built on the Conformer architecture with around 120 million parameters. This model is non-autoregressive and has been trained on approximately 1636 hours of Russian speech data.
Architecture
The Conformer-Transducer model is an autoregressive variant of the Conformer model optimized for ASR. It uses Transducer loss and decoding. Detailed architecture information can be found in the NeMo documentation.
Training
The model was trained using the NeMo toolkit across several hundred epochs. The training utilized a composite dataset called NeMo ASRSET, which includes datasets like Mozilla Common Voice, Golos, Russian LibriSpeech, and SOVA. The vocabulary comprises 33 characters, and the tokenizer was developed using training text transcripts. Rare symbols with diacritics were replaced during preprocessing.
Guide: Running Locally
-
Install PyTorch and NeMo Toolkit: Ensure you have the latest version of PyTorch installed. Then, install the NeMo toolkit with:
pip install nemo_toolkit['all']
-
Instantiate the Model: Import the ASR model from NeMo and load the pre-trained model:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/stt_ru_conformer_transducer_large")
-
Transcribe Audio: To transcribe a single audio file:
asr_model.transcribe(['<your_audio>.wav'])
For multiple audio files, use the provided script:
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py --pretrained_name="nvidia/stt_ru_conformer_transducer_large" --audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
-
System Requirements: The model requires 16 kHz mono-channel audio files (wav format). For optimal performance, using cloud GPUs like those from AWS, Google Cloud, or Azure is recommended.
License
The NVIDIA STT RU Conformer Transducer Large model is released under the CC-BY-4.0 license.