data2vec audio base 960h
facebookIntroduction
The data2vec-audio-base-960h model by Facebook AI is a self-supervised learning framework designed for automatic speech recognition (ASR). It is part of the Data2Vec framework, which applies a unified learning approach across speech, vision, and language modalities. This model is pre-trained and fine-tuned on 960 hours of 16kHz sampled Librispeech audio data.
Architecture
Data2Vec utilizes a Transformer architecture to predict latent representations of input data from a masked view in a self-distillation setup. This approach contrasts with traditional methods that predict modality-specific targets. The model aims to generate contextualized latent representations encompassing information from the entire input.
Training
The Data2Vec framework was trained using a self-supervised learning method across multiple modalities. The training process focused on predicting latent representations rather than specific targets, leveraging the rich contextual information available in the full input data. This method demonstrated state-of-the-art or competitive performance in benchmarks for speech recognition, image classification, and natural language understanding.
Guide: Running Locally
To run the data2vec-audio-base-960h model locally, follow these steps:
-
Install Dependencies: Ensure you have the
transformers
,datasets
, andtorch
libraries installed:pip install transformers datasets torch
-
Load Model and Processor:
from transformers import Wav2Vec2Processor, Data2VecForCTC processor = Wav2Vec2Processor.from_pretrained("facebook/data2vec-audio-base-960h") model = Data2VecForCTC.from_pretrained("facebook/data2vec-audio-base-960h")
-
Load Dataset:
from datasets import load_dataset ds = load_dataset("librispeech_asr", "clean", split="validation")
-
Process and Transcribe Audio:
import torch input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids)
-
Evaluate Model Performance: Evaluate the model using the Word Error Rate (WER) metric on the Librispeech dataset.
For optimal performance, it is recommended to use a cloud-based GPU service like AWS, GCP, or Azure to speed up the process.
License
This model is licensed under the Apache 2.0 License, which allows for both personal and commercial use, with proper attribution.