wav2vec2 large robust ft swbd 300h
facebookIntroduction
The wav2vec2-large-robust-ft-swbd-300h model is a fine-tuned version of Facebook's Wav2Vec2 model, specifically adapted for automatic speech recognition tasks. It is designed to transcribe speech from audio recordings and is fine-tuned on the Switchboard corpus, a dataset of telephone conversations.
Architecture
The model is based on the Wav2Vec2 architecture, which leverages self-supervised learning to capture speech representations from raw audio. It is built using the PyTorch library and is compatible with the Hugging Face Transformers ecosystem. The model is robust across various domains due to its extensive pretraining on diverse audio datasets.
Training
The model was pretrained using several datasets:
- Libri-Light: A collection of clean, read-out audio from LibriVox audiobooks.
- CommonVoice: A crowd-sourced dataset with read-out text snippets.
- Switchboard: A telephone speech corpus with noisy data.
- Fisher: A dataset featuring conversational telephone speech.
The model was then fine-tuned on 300 hours of the Switchboard dataset to enhance its accuracy in recognizing telephone speech.
Guide: Running Locally
To use the model for audio transcription, follow these steps:
-
Install Dependencies:
- Ensure you have Python installed and run:
pip install transformers datasets torch
- Ensure you have Python installed and run:
-
Load the Model:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC from datasets import load_dataset import torch processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-robust-ft-swbd-300h") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-robust-ft-swbd-300h")
-
Prepare the Data:
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
-
Transcribe Audio:
logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids)
-
Consider Cloud GPUs: For large-scale transcription tasks, using cloud GPUs from providers like AWS, Google Cloud, or Azure can significantly speed up processing.
License
The model is released under the Apache 2.0 License, allowing free use and distribution with attribution.