data2vec audio base 960h

facebook

Introduction

The data2vec-audio-base-960h model by Facebook AI is a self-supervised learning framework designed for automatic speech recognition (ASR). It is part of the Data2Vec framework, which applies a unified learning approach across speech, vision, and language modalities. This model is pre-trained and fine-tuned on 960 hours of 16kHz sampled Librispeech audio data.

Architecture

Data2Vec utilizes a Transformer architecture to predict latent representations of input data from a masked view in a self-distillation setup. This approach contrasts with traditional methods that predict modality-specific targets. The model aims to generate contextualized latent representations encompassing information from the entire input.

Training

The Data2Vec framework was trained using a self-supervised learning method across multiple modalities. The training process focused on predicting latent representations rather than specific targets, leveraging the rich contextual information available in the full input data. This method demonstrated state-of-the-art or competitive performance in benchmarks for speech recognition, image classification, and natural language understanding.

Guide: Running Locally

To run the data2vec-audio-base-960h model locally, follow these steps:

  1. Install Dependencies: Ensure you have the transformers, datasets, and torch libraries installed:

    pip install transformers datasets torch
    
  2. Load Model and Processor:

    from transformers import Wav2Vec2Processor, Data2VecForCTC
    processor = Wav2Vec2Processor.from_pretrained("facebook/data2vec-audio-base-960h")
    model = Data2VecForCTC.from_pretrained("facebook/data2vec-audio-base-960h")
    
  3. Load Dataset:

    from datasets import load_dataset
    ds = load_dataset("librispeech_asr", "clean", split="validation")
    
  4. Process and Transcribe Audio:

    import torch
    input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    
  5. Evaluate Model Performance: Evaluate the model using the Word Error Rate (WER) metric on the Librispeech dataset.

For optimal performance, it is recommended to use a cloud-based GPU service like AWS, GCP, or Azure to speed up the process.

License

This model is licensed under the Apache 2.0 License, which allows for both personal and commercial use, with proper attribution.

More Related APIs in Automatic Speech Recognition