wav2vec2 large 960h lv60 self
facebookIntroduction
The Wav2Vec2-Large-960H-LV60-SELF is an advanced Automatic Speech Recognition (ASR) model developed by Facebook AI. It is designed to transcribe speech by learning from raw audio inputs, offering state-of-the-art performance with minimal labeled data. This model was trained using a self-training approach and is suitable for tasks involving large-scale speech recognition.
Architecture
The model is based on the Wav2Vec 2.0 architecture, which processes raw audio inputs sampled at 16kHz. It utilizes a combination of pre-trained and fine-tuned approaches on datasets like Libri-Light and Librispeech. The model employs a contrastive learning objective over quantized latent representations, which are learned jointly with the model.
Training
The Wav2Vec2 model was trained on 960 hours of Libri-Light and Librispeech data. It achieves impressive Word Error Rates (WER) of 1.9% on the "clean" test set and 3.9% on the "other" test set of the LibriSpeech dataset. The model's performance highlights its ability to function effectively with limited labeled data, making it ideal for low-resource languages and domains.
Guide: Running Locally
To run the Wav2Vec2 model locally, follow these steps:
-
Install Dependencies: Ensure you have PyTorch, Transformers, and Datasets installed.
pip install torch transformers datasets
-
Load Model & Processor:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
-
Load Data: Use a dataset like LibriSpeech for evaluation.
from datasets import load_dataset ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
-
Process Audio Input: Tokenize and prepare the audio input.
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
-
Transcribe: Perform inference using the model.
import torch logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids)
-
Evaluation: Evaluate the model using Word Error Rate (WER) with a library like
jiwer
.
For optimal performance, it is recommended to use a cloud GPU service such as AWS, Google Cloud, or Azure, as the model is computationally intensive.
License
The Wav2Vec2-Large-960H-LV60-SELF model is released under the Apache 2.0 License, allowing for both personal and commercial use with proper attribution.