wav2vec2 large robust
facebookIntroduction
Wav2Vec2-Large-Robust is a speech model developed by Facebook, pre-trained on 16kHz sampled speech audio. The model utilizes datasets from various domains such as Libri-Light, CommonVoice, Switchboard, and Fisher to enhance its robustness across different types of audio data. This model is particularly designed for scenarios where the domain of unlabeled pre-training data differs from that of labeled fine-tuning data.
Architecture
The model follows a self-supervised learning approach, focusing on speech representations. It is built to handle diverse data domains, which improves its performance on unseen domains during testing. The model itself does not include a tokenizer as it is pre-trained on audio alone.
Training
Wav2Vec2-Large-Robust is trained on unlabeled data from multiple speech datasets, which helps in generalizing across various domains. The training process involves pre-training on unlabeled in-domain data, significantly reducing the gap between models trained on in-domain and out-of-domain labeled data. Fine-tuning requires the creation of a tokenizer and labeled text data.
Guide: Running Locally
-
Prerequisites:
- Ensure your speech input is sampled at 16kHz.
- Set up a Python environment with PyTorch and Transformers libraries installed.
-
Basic Steps:
- Clone the repository:
git clone https://github.com/pytorch/fairseq.git
- Navigate to the directory:
cd fairseq/examples/wav2vec
- Follow the instructions for setting up Wav2Vec2 from the provided notebook.
- Clone the repository:
-
Fine-tuning:
- Create a tokenizer and prepare labeled text data.
- Fine-tune the model as explained in the Hugging Face blog.
-
Suggested Cloud GPUs:
- AWS EC2 instances with NVIDIA GPUs.
- Google Cloud Platform’s AI Platform with NVIDIA Tesla GPUs.
- Azure Machine Learning with NVIDIA VMs.
License
The Wav2Vec2-Large-Robust model is licensed under the Apache-2.0 License, allowing for wide use and modification in both academic and commercial settings.