embedding
pyannoteIntroduction
The Pyannote Audio Speaker Embedding model leverages neural building blocks for speaker diarization. It is designed for tasks such as speaker recognition, verification, and identification. This model is built upon pyannote.audio 2.1 and uses a modified x-vector architecture with trainable SincNet features.
Architecture
The model employs the canonical x-vector TDNN-based architecture, replacing traditional filter banks with trainable SincNet features. This approach enhances the model's ability to extract meaningful speaker embeddings suitable for various voice and speech applications.
Training
The model achieves a 2.8% equal error rate (EER) on the VoxCeleb 1 test set using cosine distance without voice activity detection (VAD) or probabilistic linear discriminant analysis (PLDA). These additional methods can further improve performance.
Guide: Running Locally
Basic Steps
- Visit hf.co/pyannote/embedding to accept user conditions.
- Create an access token at hf.co/settings/tokens.
- Instantiate the pretrained model:
from pyannote.audio import Model model = Model.from_pretrained("pyannote/embedding", use_auth_token="ACCESS_TOKEN_GOES_HERE")
- Use the model for inference:
from pyannote.audio import Inference inference = Inference(model, window="whole") embedding1 = inference("speaker1.wav") embedding2 = inference("speaker2.wav")
Running on GPU
For enhanced performance, run the model on a GPU:
import torch
inference.to(torch.device("cuda"))
embedding = inference("audio.wav")
Cloud GPUs
Consider using cloud GPU providers like AWS, Google Cloud, or Azure for efficient processing.
License
The Pyannote Audio Speaker Embedding model is licensed under the MIT License, allowing for open-source usage with minimal restrictions.