wavlm base plus sd

microsoft

Introduction

WavLM-Base-Plus-SD is a model developed by Microsoft for speaker diarization, part of the broader WavLM project aimed at full-stack speech processing tasks. It leverages self-supervised learning techniques to enhance performance across various speech-related tasks.

Architecture

WavLM builds upon the HuBERT framework, enhancing it with a gated relative position bias to improve speech recognition capabilities. It also employs an utterance mixing training strategy for better speaker discrimination, where additional overlapped utterances are created and used during training.

Training

The model was pre-trained on a large dataset comprising:

  • 60,000 hours from Libri-Light
  • 10,000 hours from GigaSpeech
  • 24,000 hours from VoxPopuli

WavLM was fine-tuned on the LibriMix dataset using a simple linear layer to map network outputs.

Guide: Running Locally

To run the WavLM-Base-Plus-SD model for speaker diarization locally, follow these steps:

  1. Install Libraries: Ensure you have transformers, datasets, and torch installed.
  2. Load Dataset: Use datasets to load a sample dataset, such as librispeech_asr_demo.
  3. Initialize Components:
    • Import Wav2Vec2FeatureExtractor and WavLMForAudioFrameClassification.
    • Load the pre-trained model and feature extractor from Hugging Face.
  4. Process Audio:
    • Extract features from the audio input.
    • Pass features through the model to obtain logits.
    • Convert logits to probabilities using a sigmoid function.
  5. Determine Labels: Classify frames based on a probability threshold.

For enhanced performance, consider using cloud GPUs from providers like AWS or Google Cloud.

License

The model is licensed under the terms outlined here.

More Related APIs