wav2vec2 lv 60 espeak cv ft
facebookIntroduction
The wav2vec2-lv-60-espeak-cv-ft
model is designed for automatic speech recognition (ASR), specifically phoneme recognition in multiple languages. It is based on the pretrained wav2vec2-large-lv60
model and fine-tuned using the Common Voice dataset to transcribe phonetic labels.
Architecture
The model leverages the wav2vec2-large-lv60
architecture, a self-supervised model that has been pretrained on a large corpus of audio data. The fine-tuning process adapts this architecture to recognize and transcribe phonetic labels across multiple languages by mapping phonemes of training languages to the target language using articulatory features.
Training
The training process involves fine-tuning the wav2vec2-large-lv60
model on the Common Voice dataset, which includes multilingual phonetic labels. The model outputs a sequence of phonetic labels that can be mapped to words using a dictionary. This approach significantly improves cross-lingual phoneme recognition without requiring task-specific architectures.
Guide: Running Locally
To run the model locally, follow these steps:
-
Install Dependencies:
- Ensure that you have Python and the Hugging Face Transformers library installed.
- Install the required packages with
pip install transformers datasets torch
.
-
Load Model and Processor:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC from datasets import load_dataset import torch processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-lv-60-espeak-cv-ft") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-lv-60-espeak-cv-ft")
-
Load and Preprocess Data:
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values
-
Perform Inference:
with torch.no_grad(): logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids)
- The transcription output will be a sequence of phonetic labels.
-
Hardware Suggestions:
- For efficient processing, consider using cloud-based GPUs such as those provided by AWS, Google Cloud, or Azure.
License
This model is licensed under the Apache 2.0 License, which allows for both commercial and non-commercial use, distribution, and modification.