wav2vec2 base 960h
facebookIntroduction
Wav2Vec2-Base-960H is a speech recognition model developed by Facebook AI, designed to transcribe audio files into text. It is pretrained and fine-tuned on 960 hours of LibriSpeech data. The model operates on 16kHz sampled speech audio and has shown significant performance in reducing Word Error Rate (WER) even with limited labeled data.
Architecture
The Wav2Vec2 model learns representations from raw audio through a contrastive task defined over a quantization of the latent representations. It masks the speech input in the latent space, enabling it to learn powerful representations from unlabeled speech data. The model has been evaluated with impressive WER results, achieving 1.8/3.3 on clean/other test sets with full labeled data and maintaining performance with reduced labeled data.
Training
Wav2Vec2-Base-960H was pretrained on a large amount of unlabeled data and fine-tuned on labeled data from the LibriSpeech dataset. This approach allows the model to perform well even with minimal labeled data, demonstrating its effectiveness in semi-supervised learning scenarios.
Guide: Running Locally
To run the Wav2Vec2-Base-960H model locally:
- Install Dependencies: Ensure you have
transformers
,datasets
, andtorch
installed in your Python environment. - Load Model and Processor:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
- Load Dataset: Use the
datasets
library to load audio data.from datasets import load_dataset ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
- Transcription: Process and transcribe audio using the loaded model.
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids)
For enhanced performance, consider using a cloud GPU from providers like AWS or Google Cloud Platform to handle the computational demands.
License
The Wav2Vec2-Base-960H model is licensed under the Apache-2.0 License, which permits use, distribution, and modification under certain conditions.