speaker diarization 3.1
pyannoteIntroduction
Speaker Diarization 3.1 is an open-source pipeline from the Pyannote library, designed for identifying and segmenting speakers in audio recordings. This version eliminates the use of onnxruntime, operating entirely with PyTorch, which simplifies deployment and may improve inference speed. It processes mono audio sampled at 16kHz, providing diarization outputs in the form of Annotation instances.
Architecture
The architecture of Speaker Diarization 3.1 leverages the PyTorch framework for both speaker segmentation and embedding. The pipeline downmixes stereo or multi-channel audio to mono and resamples any audio not at 16kHz automatically. It integrates with the broader Pyannote ecosystem, requiring version 3.1 or higher of pyannote.audio.
Training
The model has been benchmarked across multiple datasets, demonstrating its capacity to process audio automatically without manual voice activity detection or preset speaker numbers. The benchmarking uses a stringent diarization error rate (DER) evaluation setup, considering overlapped speech with no forgiveness collar.
Guide: Running Locally
-
Install Requirements:
- Install
pyannote.audio
version 3.1 using pip:pip install pyannote.audio
- Accept user conditions for
pyannote/segmentation-3.0
andpyannote/speaker-diarization-3.1
. - Create an access token from Hugging Face settings.
- Install
-
Initialize the Pipeline:
from pyannote.audio import Pipeline pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.1", use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE" )
-
Run Diarization:
diarization = pipeline("audio.wav") with open("audio.rttm", "w") as rttm: diarization.write_rttm(rttm)
-
Processing on GPU:
import torch pipeline.to(torch.device("cuda"))
-
Additional Options:
- From Memory:
waveform, sample_rate = torchaudio.load("audio.wav") diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
- Monitor Progress:
from pyannote.audio.pipelines.utils.hook import ProgressHook with ProgressHook() as hook: diarization = pipeline("audio.wav", hook=hook)
- Control Speaker Count:
diarization = pipeline("audio.wav", num_speakers=2) diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
- From Memory:
Recommended Cloud GPUs: Consider using cloud services like AWS, Google Cloud, or Azure to leverage their GPU instances for more efficient processing.
License
Speaker Diarization 3.1 is released under the MIT License, ensuring it remains open-source and freely available for use and modification.