speakerverification_en_titanet_large

nvidia

Introduction

The NVIDIA SpeakerVerification EN TitaNet Large model is designed for extracting speaker embeddings from audio, which supports speaker verification and diarization tasks. It is a "large" version of the TitaNet model, containing approximately 23 million parameters. The model processes 16,000 KHz mono-channel audio files and outputs speaker embeddings.

Architecture

The TitaNet architecture is based on depth-wise separable 1D convolutions, specifically optimized for speaker verification and diarization tasks. Detailed information about the architecture can be found in the NeMo documentation.

Training

The model was trained using NVIDIA's NeMo toolkit over several hundred epochs with datasets such as Voxceleb-1, Voxceleb-2, Fisher, Switchboard, Librispeech, and SRE. The performance is evaluated using Equal Error Rate (EER) for speaker verification and Diarization Error Rate (DER) for speaker diarization.

Guide: Running Locally

  1. Installation:

    • Install the NVIDIA NeMo toolkit after setting up the latest version of PyTorch:
      pip install nemo_toolkit['all']
      
  2. Model Usage:

    • Instantiate the model using:
      import nemo.collections.asr as nemo_asr
      speaker_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained("nvidia/speakerverification_en_titanet_large")
      
    • Extract embeddings:
      emb = speaker_model.get_embedding("an255-fash-b.wav")
      
    • Verify speakers:
      speaker_model.verify_speakers("an255-fash-b.wav", "cen7-fash-b.wav")
      
  3. Cloud GPUs:

    • Consider using cloud-based GPUs from providers like AWS or Google Cloud for better performance and efficient processing.

License

This model is licensed under the CC-BY-4.0. Use of the model implies acceptance of the terms and conditions specified by the license.

More Related APIs