wav2vec2 large 960h lv60 self

facebook

Introduction

The Wav2Vec2-Large-960H-LV60-SELF is an advanced Automatic Speech Recognition (ASR) model developed by Facebook AI. It is designed to transcribe speech by learning from raw audio inputs, offering state-of-the-art performance with minimal labeled data. This model was trained using a self-training approach and is suitable for tasks involving large-scale speech recognition.

Architecture

The model is based on the Wav2Vec 2.0 architecture, which processes raw audio inputs sampled at 16kHz. It utilizes a combination of pre-trained and fine-tuned approaches on datasets like Libri-Light and Librispeech. The model employs a contrastive learning objective over quantized latent representations, which are learned jointly with the model.

Training

The Wav2Vec2 model was trained on 960 hours of Libri-Light and Librispeech data. It achieves impressive Word Error Rates (WER) of 1.9% on the "clean" test set and 3.9% on the "other" test set of the LibriSpeech dataset. The model's performance highlights its ability to function effectively with limited labeled data, making it ideal for low-resource languages and domains.

Guide: Running Locally

To run the Wav2Vec2 model locally, follow these steps:

  1. Install Dependencies: Ensure you have PyTorch, Transformers, and Datasets installed.

    pip install torch transformers datasets
    
  2. Load Model & Processor:

    from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
    
  3. Load Data: Use a dataset like LibriSpeech for evaluation.

    from datasets import load_dataset
    ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
    
  4. Process Audio Input: Tokenize and prepare the audio input.

    input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
    
  5. Transcribe: Perform inference using the model.

    import torch
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    
  6. Evaluation: Evaluate the model using Word Error Rate (WER) with a library like jiwer.

For optimal performance, it is recommended to use a cloud GPU service such as AWS, Google Cloud, or Azure, as the model is computationally intensive.

License

The Wav2Vec2-Large-960H-LV60-SELF model is released under the Apache 2.0 License, allowing for both personal and commercial use with proper attribution.

More Related APIs in Automatic Speech Recognition