wavlm base plus
microsoftIntroduction
WavLM-Base-Plus is a pre-trained speech model developed by Microsoft, designed for various speech processing tasks. It is based on the HuBERT framework and aims to handle full-stack speech tasks, focusing on both content modeling and speaker identity preservation.
Architecture
The model is built on a Transformer structure, enhanced with gated relative position bias to improve speech recognition capabilities. An utterance mixing training strategy is employed to enhance speaker discrimination. The training dataset includes 94,000 hours of speech data from sources like Libri-Light, GigaSpeech, and VoxPopuli.
Training
WavLM-Base-Plus was pre-trained on 16kHz sampled speech audio. It does not include a tokenizer, as it was trained on audio alone. Fine-tuning on labeled text data is necessary for tasks like speech recognition. The training incorporates phoneme-based input, requiring conversion of text to phonemes before fine-tuning.
Guide: Running Locally
- Setup Environment: Install the Hugging Face Transformers library along with PyTorch.
- Download Model: Retrieve the WavLM-Base-Plus model from Hugging Face's model hub.
- Prepare Data: Ensure input speech is sampled at 16kHz and convert text data to phonemes if needed.
- Fine-tune Model: Use examples from Hugging Face's repository for tasks like speech recognition or classification.
- Inference and Evaluation: After fine-tuning, evaluate the model on your specific task.
For optimal performance, consider using cloud GPUs such as those available on AWS, Google Cloud, or Azure.
License
WavLM-Base-Plus is released under a specific license available here.