speaker diarization 3.1

pyannote

Introduction

Speaker Diarization 3.1 is an open-source pipeline from the Pyannote library, designed for identifying and segmenting speakers in audio recordings. This version eliminates the use of onnxruntime, operating entirely with PyTorch, which simplifies deployment and may improve inference speed. It processes mono audio sampled at 16kHz, providing diarization outputs in the form of Annotation instances.

Architecture

The architecture of Speaker Diarization 3.1 leverages the PyTorch framework for both speaker segmentation and embedding. The pipeline downmixes stereo or multi-channel audio to mono and resamples any audio not at 16kHz automatically. It integrates with the broader Pyannote ecosystem, requiring version 3.1 or higher of pyannote.audio.

Training

The model has been benchmarked across multiple datasets, demonstrating its capacity to process audio automatically without manual voice activity detection or preset speaker numbers. The benchmarking uses a stringent diarization error rate (DER) evaluation setup, considering overlapped speech with no forgiveness collar.

Guide: Running Locally

  1. Install Requirements:

    • Install pyannote.audio version 3.1 using pip:
      pip install pyannote.audio
      
    • Accept user conditions for pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1.
    • Create an access token from Hugging Face settings.
  2. Initialize the Pipeline:

    from pyannote.audio import Pipeline
    pipeline = Pipeline.from_pretrained(
      "pyannote/speaker-diarization-3.1",
      use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE"
    )
    
  3. Run Diarization:

    diarization = pipeline("audio.wav")
    with open("audio.rttm", "w") as rttm:
        diarization.write_rttm(rttm)
    
  4. Processing on GPU:

    import torch
    pipeline.to(torch.device("cuda"))
    
  5. Additional Options:

    • From Memory:
      waveform, sample_rate = torchaudio.load("audio.wav")
      diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
      
    • Monitor Progress:
      from pyannote.audio.pipelines.utils.hook import ProgressHook
      with ProgressHook() as hook:
          diarization = pipeline("audio.wav", hook=hook)
      
    • Control Speaker Count:
      diarization = pipeline("audio.wav", num_speakers=2)
      diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
      

Recommended Cloud GPUs: Consider using cloud services like AWS, Google Cloud, or Azure to leverage their GPU instances for more efficient processing.

License

Speaker Diarization 3.1 is released under the MIT License, ensuring it remains open-source and freely available for use and modification.

More Related APIs in Automatic Speech Recognition