segmentation 3.0
pyannoteIntroduction
The pyannote/segmentation-3.0
model is designed for speaker segmentation using PyTorch. It processes 10-second mono audio samples at 16kHz and outputs speaker diarization results, identifying up to three speakers and their interactions. The model serves various applications like voice activity detection and overlapped speech detection.
Architecture
The model outputs a matrix representing different speaker classes, including non-speech and combinations of speakers. It utilizes a powerset multi-class encoding to capture complex speaker interactions.
Training
The model is trained on multiple datasets, including AISHELL, AliMeeting, AMI, AVA-AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse, using pyannote.audio
3.0.0. Detailed concepts are outlined in the paper "Powerset multi-class cross entropy loss for neural speaker diarization" by Alexis Plaquet and Hervé Bredin.
Guide: Running Locally
-
Install Dependencies:
Installpyannote.audio
using pip:pip install pyannote.audio
-
Accept User Conditions:
Accept the user conditions forpyannote/segmentation-3.0
on the Hugging Face platform. -
Access Token:
Create an access token from Hugging Face settings. -
Instantiate the Model:
from pyannote.audio import Model model = Model.from_pretrained( "pyannote/segmentation-3.0", use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE" )
-
Suggested Cloud GPUs:
Consider using cloud GPU services like AWS, Google Cloud, or Azure for efficient model processing.
License
The model is released under the MIT License, allowing for open-source use and modification. However, users might receive communications about premium models and services related to pyannote.audio
.