Introduction

The Pyannote Audio Speaker Embedding model leverages neural building blocks for speaker diarization. It is designed for tasks such as speaker recognition, verification, and identification. This model is built upon pyannote.audio 2.1 and uses a modified x-vector architecture with trainable SincNet features.

Architecture

The model employs the canonical x-vector TDNN-based architecture, replacing traditional filter banks with trainable SincNet features. This approach enhances the model's ability to extract meaningful speaker embeddings suitable for various voice and speech applications.

Training

The model achieves a 2.8% equal error rate (EER) on the VoxCeleb 1 test set using cosine distance without voice activity detection (VAD) or probabilistic linear discriminant analysis (PLDA). These additional methods can further improve performance.

Guide: Running Locally

Basic Steps

  1. Visit hf.co/pyannote/embedding to accept user conditions.
  2. Create an access token at hf.co/settings/tokens.
  3. Instantiate the pretrained model:
    from pyannote.audio import Model
    model = Model.from_pretrained("pyannote/embedding", use_auth_token="ACCESS_TOKEN_GOES_HERE")
    
  4. Use the model for inference:
    from pyannote.audio import Inference
    inference = Inference(model, window="whole")
    embedding1 = inference("speaker1.wav")
    embedding2 = inference("speaker2.wav")
    

Running on GPU

For enhanced performance, run the model on a GPU:

import torch
inference.to(torch.device("cuda"))
embedding = inference("audio.wav")

Cloud GPUs

Consider using cloud GPU providers like AWS, Google Cloud, or Azure for efficient processing.

License

The Pyannote Audio Speaker Embedding model is licensed under the MIT License, allowing for open-source usage with minimal restrictions.

More Related APIs