distilhubert
ntu-spmlIntroduction
DistilHuBERT, developed by the NTU Speech Processing & Machine Learning Lab, is a streamlined model for speech representation learning. It is designed to efficiently handle speech tasks by leveraging a reduced model size while maintaining effective performance across various tasks.
Architecture
DistilHuBERT is based on the HuBERT model, which is a self-supervised speech representation learning method. It distills hidden representations from HuBERT, reducing the model size by 75% and increasing processing speed by 73%. The model is pretrained on 16kHz sampled speech audio and does not include a tokenizer, requiring additional fine-tuning for speech recognition tasks.
Training
DistilHuBERT uses a novel multi-task learning framework to distill the capabilities of the larger HuBERT model. It requires less training time and data, making it accessible for smaller entities such as academic researchers and small companies. The methodology focuses on reducing memory and computation costs while retaining performance across ten different speech processing tasks.
Guide: Running Locally
-
Pre-requisites:
- Ensure your speech input is sampled at 16kHz.
- Install relevant libraries such as PyTorch and transformers.
-
Installation:
- Clone the repository from DistilHuBERT GitHub.
- Set up a Python environment and install dependencies listed in the repository.
-
Fine-tuning:
- Refer to the Hugging Face blog for detailed instructions on fine-tuning. Replace
Wav2Vec2ForCTC
withHubertForCTC
.
- Refer to the Hugging Face blog for detailed instructions on fine-tuning. Replace
-
Running:
- Use available scripts to run the model on your datasets, such as
librispeech_asr
.
- Use available scripts to run the model on your datasets, such as
-
Cloud GPUs:
- Consider using cloud services like AWS, Google Cloud, or Azure for GPU access to handle intensive computation efficiently.
License
DistilHuBERT is released under the Apache 2.0 License, allowing for broad usage and modification with proper attribution.