NVIDIA STT UK Citrinet 1024 Gamma 0.25

Introduction

The NVIDIA Streaming Citrinet 1024 model is designed for Automatic Speech Recognition (ASR) of Ukrainian language audio. It is a non-autoregressive model variant, fine-tuned from a Russian Citrinet-1024 model using cross-language transfer learning. The model is tailored for transcription in lowercase Ukrainian alphabet, including spaces and apostrophes.

Architecture

The model architecture is based on the Streaming Citrinet-1024, utilizing Connectionist Temporal Classification (CTC) loss/decoding. It contains approximately 141 million parameters and is optimized for streaming performance. It is compatible with NVIDIA Riva for deployment in production environments.

Training

The model was trained using NVIDIA's NeMo toolkit over 1000 epochs. It utilized the Mozilla Common Voice Corpus 10.0 dataset, excluding development and test data, with a total of 69 hours of Ukrainian speech. The Russian model was initially trained on a combination of datasets including Mozilla Common Voice V7 Ru, Ru LibriSpeech, Sber GOLOS, and SOVA datasets.

Guide: Running Locally

Install Dependencies: Ensure the latest version of PyTorch is installed, then install NVIDIA NeMo:
```
pip install nemo_toolkit['all']
```

Instantiate the Model: Use the NeMo library to load the pre-trained model:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained("nvidia/stt_uk_citrinet_1024_gamma_0_25")

Transcribe Audio: For single or multiple files, use the following commands:

asr_model.transcribe(['<Path of audio file(s)>'])

For multiple files:

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
pretrained_name="nvidia/stt_uk_citrinet_1024_gamma_0_25" \
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"

Input Requirements: The model requires 16000 kHz mono-channel WAV files as input.

Suggestion for Cloud GPUs

For enhanced performance, especially with large datasets, consider using cloud GPU services like AWS, Google Cloud, or Azure.

License

The model is released under the CC BY 4.0 license, allowing for sharing and adaptation with attribution.

More Related APIs in Automatic Speech Recognition