hubert xlarge ls960 ft

facebook

Introduction

The hubert-xlarge-ls960-ft model is an advanced automatic speech recognition (ASR) model developed by Facebook's AI team. It is based on the HuBERT (Hidden-Unit BERT) architecture and is fine-tuned on 960 hours of LibriSpeech data. The model is designed to handle various challenges in speech recognition, including multiple sound units and variable lengths of sound units without explicit segmentation.

Architecture

HuBERT employs a self-supervised learning approach that uses an offline clustering step to provide target labels for a BERT-like prediction loss. The model focuses on learning a combined acoustic and language model by applying prediction loss over masked regions. This approach helps the model achieve superior performance in ASR tasks, often surpassing the state-of-the-art wav2vec 2.0 on Librispeech and Libri-light benchmarks.

Training

The model is fine-tuned using the LibriSpeech dataset, focusing on improving the Word Error Rate (WER) across various subsets. It uses a k-means clustering method for unsupervised target label generation, and through iterative clustering, it achieves significant improvements in ASR performance.

Guide: Running Locally

To use the hubert-xlarge-ls960-ft model for ASR tasks, follow these steps:

  1. Install Dependencies: Ensure you have torch, transformers, and datasets libraries installed.

    pip install torch transformers datasets
    
  2. Load the Model: Use the Hugging Face transformers library to load the model and processor.

    import torch
    from transformers import Wav2Vec2Processor, HubertForCTC
    from datasets import load_dataset
    
    processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-xlarge-ls960-ft")
    model = HubertForCTC.from_pretrained("facebook/hubert-xlarge-ls960-ft")
    
  3. **Prepare Data

More Related APIs in Automatic Speech Recognition