segmentation 3.0

pyannote

Introduction

The pyannote/segmentation-3.0 model is designed for speaker segmentation using PyTorch. It processes 10-second mono audio samples at 16kHz and outputs speaker diarization results, identifying up to three speakers and their interactions. The model serves various applications like voice activity detection and overlapped speech detection.

Architecture

The model outputs a matrix representing different speaker classes, including non-speech and combinations of speakers. It utilizes a powerset multi-class encoding to capture complex speaker interactions.

Training

The model is trained on multiple datasets, including AISHELL, AliMeeting, AMI, AVA-AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse, using pyannote.audio 3.0.0. Detailed concepts are outlined in the paper "Powerset multi-class cross entropy loss for neural speaker diarization" by Alexis Plaquet and Hervé Bredin.

Guide: Running Locally

  1. Install Dependencies:
    Install pyannote.audio using pip:

    pip install pyannote.audio
    
  2. Accept User Conditions:
    Accept the user conditions for pyannote/segmentation-3.0 on the Hugging Face platform.

  3. Access Token:
    Create an access token from Hugging Face settings.

  4. Instantiate the Model:

    from pyannote.audio import Model
    model = Model.from_pretrained(
      "pyannote/segmentation-3.0", 
      use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE"
    )
    
  5. Suggested Cloud GPUs:
    Consider using cloud GPU services like AWS, Google Cloud, or Azure for efficient model processing.

License

The model is released under the MIT License, allowing for open-source use and modification. However, users might receive communications about premium models and services related to pyannote.audio.

More Related APIs in Voice Activity Detection