wavlm base plus sd
microsoftIntroduction
WavLM-Base-Plus-SD is a model developed by Microsoft for speaker diarization, part of the broader WavLM project aimed at full-stack speech processing tasks. It leverages self-supervised learning techniques to enhance performance across various speech-related tasks.
Architecture
WavLM builds upon the HuBERT framework, enhancing it with a gated relative position bias to improve speech recognition capabilities. It also employs an utterance mixing training strategy for better speaker discrimination, where additional overlapped utterances are created and used during training.
Training
The model was pre-trained on a large dataset comprising:
- 60,000 hours from Libri-Light
- 10,000 hours from GigaSpeech
- 24,000 hours from VoxPopuli
WavLM was fine-tuned on the LibriMix dataset using a simple linear layer to map network outputs.
Guide: Running Locally
To run the WavLM-Base-Plus-SD model for speaker diarization locally, follow these steps:
- Install Libraries: Ensure you have
transformers
,datasets
, andtorch
installed. - Load Dataset: Use
datasets
to load a sample dataset, such aslibrispeech_asr_demo
. - Initialize Components:
- Import
Wav2Vec2FeatureExtractor
andWavLMForAudioFrameClassification
. - Load the pre-trained model and feature extractor from Hugging Face.
- Import
- Process Audio:
- Extract features from the audio input.
- Pass features through the model to obtain logits.
- Convert logits to probabilities using a sigmoid function.
- Determine Labels: Classify frames based on a probability threshold.
For enhanced performance, consider using cloud GPUs from providers like AWS or Google Cloud.
License
The model is licensed under the terms outlined here.