hubert xlarge ls960 ft
facebookIntroduction
The hubert-xlarge-ls960-ft
model is an advanced automatic speech recognition (ASR) model developed by Facebook's AI team. It is based on the HuBERT (Hidden-Unit BERT) architecture and is fine-tuned on 960 hours of LibriSpeech data. The model is designed to handle various challenges in speech recognition, including multiple sound units and variable lengths of sound units without explicit segmentation.
Architecture
HuBERT employs a self-supervised learning approach that uses an offline clustering step to provide target labels for a BERT-like prediction loss. The model focuses on learning a combined acoustic and language model by applying prediction loss over masked regions. This approach helps the model achieve superior performance in ASR tasks, often surpassing the state-of-the-art wav2vec 2.0 on Librispeech and Libri-light benchmarks.
Training
The model is fine-tuned using the LibriSpeech dataset, focusing on improving the Word Error Rate (WER) across various subsets. It uses a k-means clustering method for unsupervised target label generation, and through iterative clustering, it achieves significant improvements in ASR performance.
Guide: Running Locally
To use the hubert-xlarge-ls960-ft
model for ASR tasks, follow these steps:
-
Install Dependencies: Ensure you have
torch
,transformers
, anddatasets
libraries installed.pip install torch transformers datasets
-
Load the Model: Use the Hugging Face
transformers
library to load the model and processor.import torch from transformers import Wav2Vec2Processor, HubertForCTC from datasets import load_dataset processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-xlarge-ls960-ft") model = HubertForCTC.from_pretrained("facebook/hubert-xlarge-ls960-ft")
-
**Prepare Data