hubert large ls960 ft

facebook

Introduction

The HuBERT-Large-LS960-FT is a fine-tuned model by Facebook for Automatic Speech Recognition (ASR). It is based on a self-supervised learning approach designed to handle the challenges of speech representation learning. The model is trained on 960 hours of Librispeech data, sampled at 16kHz, and aims to improve ASR performance by learning acoustic and language models simultaneously.

Architecture

HuBERT (Hidden-Unit BERT) utilizes an offline clustering step to provide target labels for a BERT-like prediction loss over masked input regions. This method emphasizes the consistency of unsupervised clustering to improve model performance. The model employs a k-means clustering approach, starting with 100 clusters and iterating twice, to match or surpass the performance of previous models like wav2vec 2.0.

Training

The model was fine-tuned on the Librispeech dataset, using subsets ranging from 10 minutes to 960 hours of data. HuBERT achieves up to 19% and 13% relative reductions in Word Error Rate (WER) on challenging evaluation subsets, demonstrating its effectiveness in ASR tasks.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Dependencies:

    • Ensure that you have Python installed along with PyTorch and the Transformers library.
    pip install torch transformers datasets
    
  2. Load the Model and Processor:

    • Use the provided code snippet to load the model and processor.
    import torch
    from transformers import Wav2Vec2Processor, HubertForCTC
    from datasets import load_dataset
    
    processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
    model = HubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft")
    
  3. Prepare the Data:

    • Load a dataset and prepare the audio input.
    ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
    input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values
    
  4. Perform Inference:

    • Run the model to get transcriptions.
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.decode(predicted_ids[0])
    
  5. Hardware Considerations:

    • For optimal performance, consider using cloud GPU services like AWS, Google Cloud, or Azure to handle the computation-intensive tasks.

License

The HuBERT-Large-LS960-FT model is licensed under the Apache-2.0 License, allowing users to freely use, modify, and distribute the model within the terms of this license.

More Related APIs in Automatic Speech Recognition