ssl_en_nest_xlarge_v1.0

nvidia

Introduction

The NVIDIA NEST XLARGE EN model is designed for speech self-supervised learning, suitable for use as a frozen speech feature extractor or for weight initialization in downstream speech processing tasks. This model contains approximately 600 million parameters and has been trained on a dataset of roughly 100,000 hours of English audio.

Architecture

The NEST framework employs a FastConformer encoder with 24 layers and a linear classifier decoder. It uses random block masking for masking and incorporates speaker/noise augmentation. The model is optimized with cross-entropy loss applied to masked positions. The input to the model is 16,000 Hz mono-channel audio in WAV format, and the output is a sequence of audio features.

Training

NVIDIA's NeMo Framework is utilized to train the model, using datasets such as LibriLight and Voxpopuli. The model's training involves a hybrid method for data collection and labeling, which includes both automated and human processes.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install NVIDIA NeMo Framework:

    • Clone the NeMo repository from GitHub.
    • Follow the installation guidelines to set up the environment.
  2. Instantiate the Model:

    from nemo.collections.asr.models import EncDecDenoiseMaskedTokenPredModel
    nest_model = EncDecDenoiseMaskedTokenPredModel.from_pretrained(model_name="nvidia/ssl_en_nest_xlarge_v1.0")
    
  3. Use as Weight Initialization:

    • Utilize the provided script to initialize weights for downstream tasks like ASR.
  4. Use as Feature Extractor:

    • Extract audio features using the script for speaker verification or similar tasks.
  5. Hardware Recommendations:

    • For optimal performance, use cloud GPUs such as NVIDIA A6000 or A100.

License

The model is licensed under the CC-BY-4.0, allowing for both commercial and non-commercial use under the terms specified by the license. More information can be found at CC-BY-4.0 License.

More Related APIs