Introduction

NVIDIA's CANARY-1B is a multilingual, multitasking model designed for automatic speech recognition (ASR) and speech-to-text translation (AST) in multiple languages. It incorporates advanced model architectures to deliver high-performance results across various benchmarks.

Architecture

CANARY-1B is structured as an encoder-decoder model, using FastConformer as the encoder and a Transformer as the decoder. The model includes 24 encoder layers and 24 decoder layers. It employs a concatenated tokenizer from Google SentencePiece to handle multiple languages efficiently. The model supports task-specific tokens for diverse applications including ASR and AST with options for punctuation and capitalization.

Training

The model was trained using NVIDIA's NeMo toolkit over 150,000 steps on 128 NVIDIA A100 GPUs. Training utilized dynamic bucketing and a batch duration of 360 seconds per GPU. The dataset comprises 85,000 hours of speech data from both public sources and proprietary collections, with significant contributions from English, German, French, and Spanish datasets.

Guide: Running Locally

  1. Install Dependencies:
    Ensure you have Cython and the latest version of PyTorch installed. Then, install NeMo:

    pip install git+https://github.com/NVIDIA/NeMo.git@r1.23.0#egg=nemo_toolkit[asr]
    
  2. Load the Model:
    Use the following code to load the pre-trained model:

    from nemo.collections.asr.models import EncDecMultiTaskModel
    canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')
    
  3. Input Configuration:
    Prepare a JSONL manifest file with audio file paths and language specifications.

  4. Run Inference:

    predicted_text = canary_model.transcribe("<path to input manifest file>", batch_size=16)
    
  5. Cloud GPUs:
    Consider using cloud services like AWS, Google Cloud, or Azure for GPU resources if local hardware is insufficient.

License

The CANARY-1B model is licensed under the Creative Commons BY-NC 4.0 license. This license permits use, distribution, and modification for non-commercial purposes, with required attribution. More details can be found here.

More Related APIs in Automatic Speech Recognition