hubert large ls960 ft
facebookIntroduction
The HuBERT-Large-LS960-FT is a fine-tuned model by Facebook for Automatic Speech Recognition (ASR). It is based on a self-supervised learning approach designed to handle the challenges of speech representation learning. The model is trained on 960 hours of Librispeech data, sampled at 16kHz, and aims to improve ASR performance by learning acoustic and language models simultaneously.
Architecture
HuBERT (Hidden-Unit BERT) utilizes an offline clustering step to provide target labels for a BERT-like prediction loss over masked input regions. This method emphasizes the consistency of unsupervised clustering to improve model performance. The model employs a k-means clustering approach, starting with 100 clusters and iterating twice, to match or surpass the performance of previous models like wav2vec 2.0.
Training
The model was fine-tuned on the Librispeech dataset, using subsets ranging from 10 minutes to 960 hours of data. HuBERT achieves up to 19% and 13% relative reductions in Word Error Rate (WER) on challenging evaluation subsets, demonstrating its effectiveness in ASR tasks.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Dependencies:
- Ensure that you have Python installed along with PyTorch and the Transformers library.
pip install torch transformers datasets
-
Load the Model and Processor:
- Use the provided code snippet to load the model and processor.
import torch from transformers import Wav2Vec2Processor, HubertForCTC from datasets import load_dataset processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft") model = HubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft")
-
Prepare the Data:
- Load a dataset and prepare the audio input.
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values
-
Perform Inference:
- Run the model to get transcriptions.
logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.decode(predicted_ids[0])
-
Hardware Considerations:
- For optimal performance, consider using cloud GPU services like AWS, Google Cloud, or Azure to handle the computation-intensive tasks.
License
The HuBERT-Large-LS960-FT model is licensed under the Apache-2.0 License, allowing users to freely use, modify, and distribute the model within the terms of this license.