wav2vec2 base 10k voxpopuli ft es

facebook

Introduction

The wav2vec2-base-10k-voxpopuli-ft-es model, developed by Facebook AI, is a fine-tuned version of the Wav2Vec2 base model. It is specifically trained for automatic speech recognition tasks in Spanish, utilizing the 10K unlabeled subset of the VoxPopuli corpus. The model is designed to convert audio data into text, particularly for speech in the Spanish language.

Architecture

This model is based on the Wav2Vec2 architecture, which is known for its ability to learn the structure of speech from raw audio. The base model is pre-trained on a large-scale multilingual speech corpus, VoxPopuli, before being fine-tuned with transcribed Spanish data. This approach is effective in creating a robust model capable of handling various speech recognition tasks.

Training

The training process involves two main stages:

  1. Pre-training: The model is trained on the 10K unlabeled subset of the VoxPopuli corpus to learn speech representations.
  2. Fine-tuning: The model is fine-tuned on transcribed Spanish data to improve its accuracy in recognizing Spanish speech. The details and results of this training process are documented in the paper titled "VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation."

Guide: Running Locally

To run this model locally, follow these steps:

  1. Install Required Libraries: Ensure you have the transformers, datasets, torchaudio, and torch libraries installed.
  2. Load the Model and Processor:
    from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-es")
    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-10k-voxpopuli-ft-es")
    
  3. Load and Prepare the Dataset:
    from datasets import load_dataset
    import torchaudio
    
    ds = load_dataset("common_voice", "es", split="validation[:1%]")
    resampler = torchaudio.transforms.Resample(48000, 16000)
    
    def map_to_array(batch):
        speech, _ = torchaudio.load(batch["path"])
        speech = resampler(speech)
        batch["speech"] = speech[0]
        return batch
    
    ds = ds.map(map_to_array)
    
  4. Run Inference:
    inputs = processor(ds[:5]["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
    logits = model(**inputs).logits
    predicted_ids = torch.argmax(logits, axis=-1)
    print(processor.batch_decode(predicted_ids))
    
  5. Cloud GPU Suggestions: For efficient processing, consider using cloud GPUs such as those offered by AWS, Google Cloud, or Azure.

License

This model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This license permits use for non-commercial purposes, provided appropriate credit is given.

More Related APIs in Automatic Speech Recognition