wav2vec2 base 10k voxpopuli ft es
facebookIntroduction
The wav2vec2-base-10k-voxpopuli-ft-es
model, developed by Facebook AI, is a fine-tuned version of the Wav2Vec2 base model. It is specifically trained for automatic speech recognition tasks in Spanish, utilizing the 10K unlabeled subset of the VoxPopuli corpus. The model is designed to convert audio data into text, particularly for speech in the Spanish language.
Architecture
This model is based on the Wav2Vec2 architecture, which is known for its ability to learn the structure of speech from raw audio. The base model is pre-trained on a large-scale multilingual speech corpus, VoxPopuli, before being fine-tuned with transcribed Spanish data. This approach is effective in creating a robust model capable of handling various speech recognition tasks.
Training
The training process involves two main stages:
- Pre-training: The model is trained on the 10K unlabeled subset of the VoxPopuli corpus to learn speech representations.
- Fine-tuning: The model is fine-tuned on transcribed Spanish data to improve its accuracy in recognizing Spanish speech. The details and results of this training process are documented in the paper titled "VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation."
Guide: Running Locally
To run this model locally, follow these steps:
- Install Required Libraries: Ensure you have the
transformers
,datasets
,torchaudio
, andtorch
libraries installed. - Load the Model and Processor:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-es") processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-es")
- Load and Prepare the Dataset:
from datasets import load_dataset import torchaudio ds = load_dataset("common_voice", "es", split="validation[:1%]") resampler = torchaudio.transforms.Resample(48000, 16000) def map_to_array(batch): speech, _ = torchaudio.load(batch["path"]) speech = resampler(speech) batch["speech"] = speech[0] return batch ds = ds.map(map_to_array)
- Run Inference:
inputs = processor(ds[:5]["speech"], sampling_rate=16000, return_tensors="pt", padding=True) logits = model(**inputs).logits predicted_ids = torch.argmax(logits, axis=-1) print(processor.batch_decode(predicted_ids))
- Cloud GPU Suggestions: For efficient processing, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.
License
This model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This license permits use for non-commercial purposes, provided appropriate credit is given.